[00:00:05] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T0000). [00:05:58] (03CR) 10Legoktm: "> I wasn't sure how manual DNS switchover was these days. I'd still love to have either this alias or some other way to simplify this for " [dns] - 10https://gerrit.wikimedia.org/r/708874 (owner: 10Thcipriani) [00:06:44] (03PS2) 10Legoktm: noc: Expose primary datacenter on conf/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 [00:18:03] (03PS2) 10Legoktm: Remove DynamicPageList from all Wikimania wikis except 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709585 (https://phabricator.wikimedia.org/T287916) [00:20:55] (03CR) 10Legoktm: [C: 03+2] Remove DynamicPageList from all Wikimania wikis except 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709585 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [00:21:39] (03Merged) 10jenkins-bot: Remove DynamicPageList from all Wikimania wikis except 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709585 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [00:24:03] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Remove DynamicPageList from all Wikimania wikis except 2016 (T287916) (duration: 01m 52s) [00:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:12] T287916: Disable DPL on wikis that aren't using it - https://phabricator.wikimedia.org/T287916 [00:31:58] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:52] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:30] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:12] (03CR) 10Jeena Huneidi: [C: 04-1] "Nice! The app is working for me. I left one comment about the secret that might be nice to change. Otherwise I think we could work on benc" [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [01:06:41] (03CR) 10Krinkle: "This seems a bit reaching into internals and duplicating information. The same info is also available at run-time here already, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 (owner: 10Legoktm) [01:08:28] (03CR) 10Krinkle: "Hm.. appears not. I thought db.php actually loaded WebStart and read this from etcd, but it doesn't." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 (owner: 10Legoktm) [01:10:03] (03CR) 10Krinkle: noc: Expose primary datacenter on conf/ (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 (owner: 10Legoktm) [01:15:32] (03PS1) 10Krinkle: noc: Fix warning on conf/index.php when testing locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710130 [01:16:56] (03PS2) 10Krinkle: noc: Fix warning on conf/index.php when testing locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710130 [01:26:52] !log krinkle@mwmaint1002 Temporarily grant myself `translationadmin` on wikimania2016wiki in order to approve an edit given FlaggedRevs-like nature of Translate [01:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:55] !log on mwmaint1002 killing populateEditCount.php for loginwiki -- it's slow but it's not going to find any edits [02:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:56:28] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:01:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:05:32] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:28:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:30:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:37:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:41:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:47:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:50:40] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:58:10] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:03:50] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:13:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:15:14] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:24:42] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:25:51] (03CR) 10Legoktm: [C: 03+1] noc: Fix warning on conf/index.php when testing locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710130 (owner: 10Krinkle) [04:26:34] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:34:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:39:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:47:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:16] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:59:22] (03CR) 10Marostegui: production-m5.sql.erb: Add toolhub grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709877 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [05:01:11] (03PS2) 10Marostegui: production-m5.sql.erb: Add toolhub grants [puppet] - 10https://gerrit.wikimedia.org/r/709877 (https://phabricator.wikimedia.org/T271480) [05:01:25] (03PS2) 10Marostegui: dbproxy1013,dbproxy1015: Promote db1183 to master [puppet] - 10https://gerrit.wikimedia.org/r/709673 (https://phabricator.wikimedia.org/T287852) [05:04:27] (03CR) 10Marostegui: "Apart from my comment inline, I would assume that to deploy this we just need to do the usual scap sync-file and deploy ProductionServices" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 (owner: 10Krinkle) [05:04:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:08:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:15:56] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:17:50] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:38:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:16] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:53:42] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:15:41] !log add back thanos-be1003 sdf1 in thanos-swift [06:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:02] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [06:40:12] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [06:40:45] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [06:43:17] (03PS1) 10Marostegui: db1183: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/710215 (https://phabricator.wikimedia.org/T287852) [06:44:01] (03CR) 10Marostegui: [C: 03+2] db1183: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/710215 (https://phabricator.wikimedia.org/T287852) (owner: 10Marostegui) [06:46:01] (03PS1) 10Marostegui: mariadb: Promote db1183 to m2 master. [puppet] - 10https://gerrit.wikimedia.org/r/710216 (https://phabricator.wikimedia.org/T287852) [06:46:49] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [06:51:23] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Addshore) [06:52:51] (03CR) 10Physikerwelt: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [06:54:36] !log prometheus/ops codfw +100G [06:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:25] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) [06:59:52] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:01:19] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Ladsgroup) It's a bit hard to implement this as systemd timers are not concurrent and the crons here are designed to be three at the same... [07:03:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1183 to m2 master. [puppet] - 10https://gerrit.wikimedia.org/r/710216 (https://phabricator.wikimedia.org/T287852) (owner: 10Marostegui) [07:03:58] (03CR) 10JMeybohm: [C: 04-1] "I think this is fine. But if you plan to ever enable debug ports on one of the clusters, you should reserve the nodeports (https://wikitec" [deployment-charts] - 10https://gerrit.wikimedia.org/r/710111 (https://phabricator.wikimedia.org/T255871) (owner: 10Ottomata) [07:04:13] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Legoktm) >>! In T288175#7262223, @Ladsgroup wrote: > It's a bit hard to implement this as systemd timers are not concurrent and the crons... [07:05:24] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:06:50] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Ladsgroup) This is going to get replaced with jobs soon (maybe in a couple of months) so I wouldn't put too much work in it. Having three... [07:10:08] (03CR) 10Marostegui: [C: 03+2] dbproxy1013,dbproxy1015: Promote db1183 to master [puppet] - 10https://gerrit.wikimedia.org/r/709673 (https://phabricator.wikimedia.org/T287852) (owner: 10Marostegui) [07:11:35] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [07:14:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:53] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [07:19:46] (03CR) 10Ladsgroup: Add shellbox-constraint services and use them (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [07:20:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Move parsercache DB config to *Services.php (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 (owner: 10Krinkle) [07:20:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Just a nit on variable naming, but you can ignore it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703630 (owner: 10Krinkle) [07:21:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Move parsercache DB config to *Services.php (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703631 (owner: 10Krinkle) [07:22:37] (03PS4) 10Ladsgroup: Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) [07:27:34] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:28:08] (03CR) 10Legoktm: [C: 03+1] Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [07:31:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:37:33] (03CR) 10Marostegui: [C: 03+1] "I think this is fine, but I would like Joe and volans to also confirm" [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [07:39:00] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: filter deployable files by 'deploy-tag' [puppet] - 10https://gerrit.wikimedia.org/r/710007 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [07:39:08] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add /srv/alerts-thanos to rule alerts path [puppet] - 10https://gerrit.wikimedia.org/r/710010 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [07:39:20] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: add Thanos-specific alerts deploy [puppet] - 10https://gerrit.wikimedia.org/r/710009 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [07:39:24] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: refactor into ::prometheus [puppet] - 10https://gerrit.wikimedia.org/r/710008 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [07:55:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:34] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:56:38] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:56:58] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:56:58] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:57:24] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:57:48] uh? [08:00:13] !log Failover m2 from db1107 to db1183 - T287852 [08:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:21] T287852: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 [08:03:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:03:16] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [08:05:21] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) Failover was done. Read only time times: Start: 08:00:29 AM UTC Stop: 08:00:47 AM UTC Total: 18 seconds [08:06:17] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [08:07:47] (03CR) 10Kormat: [V: 03+2 C: 03+2] xhgui: add dummy admin password [labs/private] - 10https://gerrit.wikimedia.org/r/672461 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [08:08:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:09:01] (03PS1) 10Marostegui: db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/710217 (https://phabricator.wikimedia.org/T287852) [08:09:03] (03CR) 10Kormat: xhgui: enable database access for admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [08:10:46] (03CR) 10Marostegui: [C: 03+2] db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/710217 (https://phabricator.wikimedia.org/T287852) (owner: 10Marostegui) [08:11:46] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:11:48] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:11:53] (03PS2) 10David Caro: prometheus.icinga_exporter: Use per-label regexes on team labels [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709671 [08:11:55] (03CR) 10David Caro: prometheus.icinga_exporter: Use per-label regexes on team labels (034 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709671 (owner: 10David Caro) [08:12:06] (03CR) 10David Caro: prometheus.icinga_exporter: Use per-label regexes on team labels (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709671 (owner: 10David Caro) [08:12:08] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:12:08] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:12:34] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:12:50] (03CR) 10David Caro: prometheus.icinga_exporter: Use per-label regexes on team labels (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709671 (owner: 10David Caro) [08:13:02] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [08:13:14] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) 05Open→03Resolved [08:14:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:07] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10ema) >>! In T287983#7261627, @Legoktm wrote: > I... [08:19:34] (03CR) 10Jelto: [C: 03+1] "lgtm and you are right the replica isn't intended for end user access at the moment. So I'm okay with just opening up GitLab and not the r" [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [08:23:47] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I added a comment on the phabricator task, my opposition to this change stands." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709991 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [08:25:32] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:26:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:26:47] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: enforce minimum 30s dashboard refresh [puppet] - 10https://gerrit.wikimedia.org/r/710058 (https://phabricator.wikimedia.org/T119719) (owner: 10Filippo Giunchedi) [08:28:23] !log bounce grafana to apply new settings - T119719 [08:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:30] T119719: Enforce a minimum refresh period for grafana dashboards hitting graphite - https://phabricator.wikimedia.org/T119719 [08:29:59] 10SRE, 10observability, 10Graphite, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Enforce a minimum refresh period for grafana dashboards hitting graphite - https://phabricator.wikimedia.org/T119719 (10fgiunchedi) 05Open→03Resolved Complete, minimum `30s` refresh rate now. Simple enough t... [08:33:55] (03PS1) 10JMeybohm: docker_registry_ha: Increase nginx tmpfs size from 1gb to 2gb [puppet] - 10https://gerrit.wikimedia.org/r/710218 (https://phabricator.wikimedia.org/T288198) [08:35:34] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30484/console" [puppet] - 10https://gerrit.wikimedia.org/r/710218 (https://phabricator.wikimedia.org/T288198) (owner: 10JMeybohm) [08:36:14] (03CR) 10JMeybohm: docker_registry_ha: Increase nginx tmpfs size from 1gb to 2gb [puppet] - 10https://gerrit.wikimedia.org/r/710218 (https://phabricator.wikimedia.org/T288198) (owner: 10JMeybohm) [08:41:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:42:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:44:36] (03CR) 10Volans: "What is the order of steps for deploying this?" [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [08:45:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_registry_ha: Increase nginx tmpfs size from 1gb to 2gb [puppet] - 10https://gerrit.wikimedia.org/r/710218 (https://phabricator.wikimedia.org/T288198) (owner: 10JMeybohm) [08:46:52] (03CR) 10RhinosF1: [C: 04-1] Conftool-sections: farewell s10 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708631 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [08:52:01] 10SRE, 10Traffic: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10ema) p:05Triage→03Medium [08:53:45] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:53:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice work!" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709671 (owner: 10David Caro) [08:54:47] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:55:13] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/710114 (https://phabricator.wikimedia.org/T285804) (owner: 10Legoktm) [08:59:18] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/710121 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [09:00:33] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga::ircbot: Send database notifications to #wikimedia-data-persistence [puppet] - 10https://gerrit.wikimedia.org/r/710002 (https://phabricator.wikimedia.org/T283580) (owner: 10LSobanski) [09:03:45] (03CR) 10LSobanski: [C: 03+2] icinga::ircbot: Send database notifications to #wikimedia-data-persistence [puppet] - 10https://gerrit.wikimedia.org/r/710002 (https://phabricator.wikimedia.org/T283580) (owner: 10LSobanski) [09:04:19] 10SRE, 10Traffic: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10ema) [09:05:20] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:21] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10MSantos) @Legoktm from #product-infrastructure-team-backlog which are the official maintainers of maps, this looks great.... [09:06:43] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Fuzzy) I stumbled upon a similar problem, with the Israeli [[ https://he.wikisource.org/wiki/תקנות_הגנת_הצומח_(יבוא_צמחים,_מוצרי_צמחים,_נגעים_ואמ... [09:10:41] (03CR) 10JMeybohm: [C: 03+2] docker_registry_ha: Increase nginx tmpfs size from 1gb to 2gb [puppet] - 10https://gerrit.wikimedia.org/r/710218 (https://phabricator.wikimedia.org/T288198) (owner: 10JMeybohm) [09:12:10] sobanski: okay do merge your change? [09:12:29] icinga::ircbot: Send database notifications to #wikimedia-data-persistence (064a5a8514) - that is [09:13:25] (03PS6) 10Hnowlan: maps: make maps1008 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582) [09:14:05] jayme: yup, thanks [09:14:16] done [09:14:53] (03CR) 10Hnowlan: [C: 03+2] maps: make maps1008 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [09:19:17] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=registry2004.codfw.wmnet,dc=codfw,cluster=docker-registry [09:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:54] (03CR) 10Vgutierrez: [C: 03+1] configmaster: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/709668 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:28:46] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:17] (03CR) 10MMandere: [C: 03+2] configmaster: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/709668 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:32:46] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [09:33:46] (03PS1) 10Ladsgroup: Route Shellbox requests to 'constraint-regex-checker' service [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710092 (https://phabricator.wikimedia.org/T176312) [09:34:07] (03PS1) 10Ladsgroup: Route Shellbox requests to 'constraint-regex-checker' service [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710093 (https://phabricator.wikimedia.org/T176312) [09:34:17] (03CR) 10Ladsgroup: [C: 03+2] Route Shellbox requests to 'constraint-regex-checker' service [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710092 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [09:34:23] (03CR) 10Ladsgroup: [C: 03+2] Route Shellbox requests to 'constraint-regex-checker' service [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710093 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [09:34:45] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1008.eqiad.wmnet with reason: REIMAGE [09:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:59] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1008.eqiad.wmnet with reason: REIMAGE [09:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:18] (03PS1) 10Ema: cache: single backend experiment [puppet] - 10https://gerrit.wikimedia.org/r/710224 (https://phabricator.wikimedia.org/T288106) [09:37:41] (03PS5) 10Ladsgroup: Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) [09:37:46] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [09:39:22] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/710224 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [09:40:38] 10SRE: A few hosts on production with software raid (md) have partitions in resync=PENDING status - https://phabricator.wikimedia.org/T288212 (10jcrespo) [09:49:07] 10SRE, 10vm-requests: codfw: 1 VM request for Dragonfly supernode - https://phabricator.wikimedia.org/T288216 (10JMeybohm) [09:49:17] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:32] (03PS1) 10Elukey: Improve the kubeflow-kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/710226 (https://phabricator.wikimedia.org/T272919) [09:50:04] (03CR) 10jerkins-bot: [V: 04-1] Improve the kubeflow-kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/710226 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:52:46] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [09:55:01] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:55:57] (03Merged) 10jenkins-bot: Route Shellbox requests to 'constraint-regex-checker' service [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710092 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [09:56:00] (03Merged) 10jenkins-bot: Route Shellbox requests to 'constraint-regex-checker' service [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710093 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [09:56:03] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:56:32] !log jayme@cumin1001 START - Cookbook sre.ganeti.makevm for new host dragonfly-supernode2001.codfw.wmnet [09:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:26] 10SRE: A few hosts on production with software raid (md) have partitions in resync=PENDING status - https://phabricator.wikimedia.org/T288212 (10Kormat) https://unix.stackexchange.com/a/101190/358110 My reading of this is that this is a normal state if the given array hasn't been written to. Looking at puppetma... [09:57:46] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [09:59:29] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php: Backport: [[gerrit:710092|Route Shellbox requests to 'constraint-regex-checker' service (T176312)]] (duration: 01m 27s) [09:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:36] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [10:00:04] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T1000). [10:00:08] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [10:01:06] (03PS6) 10Ladsgroup: Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) [10:01:18] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php: Backport: [[gerrit:710093|Route Shellbox requests to 'constraint-regex-checker' service (T176312)]] (duration: 01m 06s) [10:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:36] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10cmooney) [10:02:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10cmooney) [10:03:00] (03CR) 10Ladsgroup: [C: 03+2] Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [10:03:41] (03Merged) 10jenkins-bot: Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [10:03:43] !log Reconfiguring packet buffer partitioning on cloudsw-c8-eqiad T288036 [10:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:49] T288036: Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 [10:04:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:13] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dragonfly-supernode2001.codfw.wmnet [10:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:19] 10SRE: A few hosts on production with software raid (md) have partitions in resync=PENDING status - https://phabricator.wikimedia.org/T288212 (10jcrespo) 05Open→03Resolved a:03Kormat Based on comments and survey I received, this is most probably a normal state, that will auto-correct on first write (all PE... [10:11:37] !log restart acme-chief on acmechief1001 [10:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:39] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:01] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:15:19] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:17:06] (03PS1) 10Ladsgroup: Add 'constraint-regex-checker' to isEnabled() check as well [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710094 (https://phabricator.wikimedia.org/T176312) [10:17:35] (03PS1) 10Ladsgroup: Add 'constraint-regex-checker' to isEnabled() check as well [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710095 (https://phabricator.wikimedia.org/T176312) [10:19:00] (03PS1) 10Giuseppe Lavagetto: namespaces: allow defining the tiller resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/710228 [10:21:33] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:22:14] (03PS1) 10Hnowlan: maps: reimage maps2006 as a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/710231 (https://phabricator.wikimedia.org/T269582) [10:22:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:23:07] !log ladsgroup@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:709821|Add shellbox-constraint services and use them (T176312)]], Part I (duration: 01m 07s) [10:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:15] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [10:23:29] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:24:29] !log ladsgroup@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:709821|Add shellbox-constraint services and use them (T176312)]], Part II (duration: 01m 07s) [10:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:53] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:25:38] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:709821|Add shellbox-constraint services and use them (T176312)]], Part III (duration: 01m 06s) [10:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline but not blocking" [puppet] - 10https://gerrit.wikimedia.org/r/709053 (owner: 10David Caro) [10:30:02] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [10:30:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10cmooney) 05Open→03Resolved [10:31:59] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:55] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:28] (03CR) 10Legoktm: [C: 03+2] mediawiki: Ignore php-fpm when stopping cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/710114 (https://phabricator.wikimedia.org/T285804) (owner: 10Legoktm) [10:44:16] (03PS1) 10Hnowlan: maps: reimage maps2005 as buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/710234 (https://phabricator.wikimedia.org/T269582) [10:48:11] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:48:13] (03PS1) 10Legoktm: sre.switchdc.services: Exclude helm-charts, lacking a service IP [cookbooks] - 10https://gerrit.wikimedia.org/r/710235 (https://phabricator.wikimedia.org/T285707) [10:49:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:50:03] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:50:30] (03PS1) 10Ema: cache: refactor dynamic_backend_caches logic [puppet] - 10https://gerrit.wikimedia.org/r/710236 (https://phabricator.wikimedia.org/T288106) [10:51:10] (03CR) 10jerkins-bot: [V: 04-1] sre.switchdc.services: Exclude helm-charts, lacking a service IP [cookbooks] - 10https://gerrit.wikimedia.org/r/710235 (https://phabricator.wikimedia.org/T285707) (owner: 10Legoktm) [10:51:37] (03Merged) 10jenkins-bot: mediawiki: Ignore php-fpm when stopping cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/710114 (https://phabricator.wikimedia.org/T285804) (owner: 10Legoktm) [10:52:06] (03CR) 10Legoktm: "CI failure seems unrelated." [cookbooks] - 10https://gerrit.wikimedia.org/r/710235 (https://phabricator.wikimedia.org/T285707) (owner: 10Legoktm) [10:52:58] 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: switchdc check on mwmaint for running PHP processes should ignore php-fpm processes - https://phabricator.wikimedia.org/T285804 (10Legoktm) 05Open→03Resolved Will be included in the next spicerack release. [10:53:22] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30485/console" [puppet] - 10https://gerrit.wikimedia.org/r/710236 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [11:00:04] Amir1, Lucas_WMDE, and apergos: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for EU Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T1100). [11:00:04] dcausse and MatmaRex: A patch you scheduled for EU Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:29] here. [11:00:34] no trainees today. [11:01:27] dcausse your patches look discrete and plain enough. MatmaRex, do yours implicate i18n? if so, maybe they need to wait for the train? I'd like someone who understands that better to weigh in . [11:01:32] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [11:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:43] also, are you both self-serve or does either of you need someone to deploy? [11:02:28] also, I prefer not to be the only deployer for the window, Lucas_WMDE are you around? [11:02:41] theoretically here but making lunch [11:02:44] but I can keep an eye on IRC ^^ [11:02:49] sounds great [11:03:03] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:06:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:51] !log Reconfiguring packet buffer partitioning on cloudsw-d5-eqiad T288037 [11:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:06] T288037: Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 [11:11:56] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2006.codfw.wmnet [11:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:39] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/710238 (owner: 10L10n-bot) [11:14:25] (03PS2) 10Legoktm: varnish: Improve comments around maps access, retire T261694 [puppet] - 10https://gerrit.wikimedia.org/r/709511 (https://phabricator.wikimedia.org/T261694) [11:15:47] (03CR) 10Legoktm: [C: 03+2] varnish: Improve comments around maps access, retire T261694 [puppet] - 10https://gerrit.wikimedia.org/r/709511 (https://phabricator.wikimedia.org/T261694) (owner: 10Legoktm) [11:19:22] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Legoktm) >>! In T261694#7262470, @MSantos wrote: > @Legoktm from #product-infrastructure-team-backlog which are the offici... [11:19:46] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10Legoktm) [11:19:56] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Legoktm) 05Open→03Resolved [11:20:36] (03CR) 10David Caro: [V: 03+2 C: 03+2] prometheus.icinga_exporter: Use per-label regexes on team labels [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709671 (owner: 10David Caro) [11:23:02] (03PS3) 10Hnowlan: maps: reimage maps1010 as buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702619 (https://phabricator.wikimedia.org/T269582) [11:23:31] (03PS2) 10Ema: cache: refactor dynamic_backend_caches logic [puppet] - 10https://gerrit.wikimedia.org/r/710236 (https://phabricator.wikimedia.org/T288106) [11:23:33] (03PS2) 10Ema: cache: single backend experiment [puppet] - 10https://gerrit.wikimedia.org/r/710224 (https://phabricator.wikimedia.org/T288106) [11:23:50] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:24:26] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10cmooney) 05Open→03Resolved [11:25:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:26:02] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:30:32] we are halfway through the window. dcausse? are you planning to deploy? [11:30:53] I would ping MatmaRex but they don't seem to be in here [11:31:06] Lucas_WMDE: a fyi ^^ [11:33:50] ack [11:33:57] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:35:05] (03PS3) 10Ema: cache: single backend experiment [puppet] - 10https://gerrit.wikimedia.org/r/710224 (https://phabricator.wikimedia.org/T288106) [11:35:08] (03PS1) 10Ema: cache: enable single backend experiment on cp4027 [puppet] - 10https://gerrit.wikimedia.org/r/710244 (https://phabricator.wikimedia.org/T288106) [11:35:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:38:53] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps2008.codfw.wmnet with reason: Rebuilding as buster replica of maps2009 [11:38:54] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps2008.codfw.wmnet with reason: Rebuilding as buster replica of maps2009 [11:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps2006.codfw.wmnet with reason: Rebuilding as buster replica of maps2009 [11:39:07] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps2006.codfw.wmnet with reason: Rebuilding as buster replica of maps2009 [11:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:19] !log removing maps2006 from old maps cassandra cluster [11:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:17] !log prepare cloudsw1-c8-eqiad for cloudsw2-c8 - T277340 [11:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:25] T277340: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 [11:48:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:50:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:50:46] apergos: I have to leave now, FYI [11:52:49] well there's only 8 minutes [11:53:01] so I'm going to say the window is closed, 8minutes is not time to merge deploy and sit and watch\ [11:53:10] see you later Lucas_WMDE [11:55:52] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1008.eqiad.wmnet [11:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:23] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10ayounsi) Thanks, I got the initial configuration done. Left to do for `C8`: * document cables details (mgmt + DACs) in Netbox - @Jclark-ctr * Upgrade Junos - Netops * Put speci... [11:56:49] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:58:28] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1010.eqiad.wmnet [11:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:43] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:59:09] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1010.eqiad.wmnet with reason: Rebuilding as buster replica of maps1009 [11:59:10] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1010.eqiad.wmnet with reason: Rebuilding as buster replica of maps1009 [11:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:49] PROBLEM - Host cloudvirt1038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:10:43] damn sorry completely missed the deploy backport&config window, sorry for the confusion [12:14:39] (03CR) 10JMeybohm: [C: 03+1] namespaces: allow defining the tiller resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/710228 (owner: 10Giuseppe Lavagetto) [12:14:39] welp, next window for you :-D [12:17:19] (03CR) 10JMeybohm: [C: 03+2] dragonfly: Enable metric scraping for dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/709703 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [12:19:47] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:21:30] (03CR) 10JMeybohm: [C: 03+1] sre.switchdc.services: Exclude helm-charts, lacking a service IP [cookbooks] - 10https://gerrit.wikimedia.org/r/710235 (https://phabricator.wikimedia.org/T285707) (owner: 10Legoktm) [12:23:39] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:28:12] (03PS2) 10Majavah: metricsinfra: Add config management server [puppet] - 10https://gerrit.wikimedia.org/r/710068 (https://phabricator.wikimedia.org/T286299) [12:28:35] (03PS1) 10JMeybohm: site/install_server: Add dragonfly-supernode2001 to DHCP and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710247 (https://phabricator.wikimedia.org/T288216) [12:29:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:29:38] (03CR) 10JMeybohm: [C: 03+2] site/install_server: Add dragonfly-supernode2001 to DHCP and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710247 (https://phabricator.wikimedia.org/T288216) (owner: 10JMeybohm) [12:31:49] 10SRE, 10vm-requests: eqiad: 1 VM request for Dragonfly supernode - https://phabricator.wikimedia.org/T286057 (10JMeybohm) FTR: I've reduced the memory of this instance to 2GB [12:32:03] 10SRE, 10vm-requests, 10Patch-For-Review: codfw: 1 VM request for Dragonfly supernode - https://phabricator.wikimedia.org/T288216 (10JMeybohm) 05Open→03Resolved [12:33:15] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:35:52] (03PS1) 10Filippo Giunchedi: sre: add alerting cluster puppet fail [alerts] - 10https://gerrit.wikimedia.org/r/710248 (https://phabricator.wikimedia.org/T283151) [12:36:43] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:37:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] namespaces: allow defining the tiller resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/710228 (owner: 10Giuseppe Lavagetto) [12:40:16] (03Merged) 10jenkins-bot: namespaces: allow defining the tiller resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/710228 (owner: 10Giuseppe Lavagetto) [12:40:33] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:42:49] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:44:06] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:44:06] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:13] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:42] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:39] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:52:34] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:52:35] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:02] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:14] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30486/console" [puppet] - 10https://gerrit.wikimedia.org/r/710224 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [12:59:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:00:07] (03CR) 10Ema: [C: 03+2] cache: refactor dynamic_backend_caches logic [puppet] - 10https://gerrit.wikimedia.org/r/710236 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [13:01:38] (03PS4) 10David Caro: prometheus.icinga_exporter: Add label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709053 [13:01:39] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:01:45] (03CR) 10David Caro: prometheus.icinga_exporter: Add label_teams_config parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709053 (owner: 10David Caro) [13:01:59] (03PS6) 10David Caro: profile.icinga_exporter: Added label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709054 [13:02:05] (03PS4) 10Ema: cache: single backend experiment [puppet] - 10https://gerrit.wikimedia.org/r/710224 (https://phabricator.wikimedia.org/T288106) [13:02:07] (03PS3) 10David Caro: prometheus: added some wmcs team label configs [puppet] - 10https://gerrit.wikimedia.org/r/709471 [13:02:36] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: Move image name and version under main_app [deployment-charts] - 10https://gerrit.wikimedia.org/r/708974 (https://phabricator.wikimedia.org/T287374) (owner: 10DCausse) [13:02:49] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/709471 (owner: 10David Caro) [13:03:06] (03CR) 10Filippo Giunchedi: [C: 03+1] profile.icinga_exporter: Added label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709054 (owner: 10David Caro) [13:03:46] (03CR) 10Ladsgroup: [C: 03+2] Add 'constraint-regex-checker' to isEnabled() check as well [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710095 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [13:03:50] (03CR) 10Ladsgroup: [C: 03+2] Add 'constraint-regex-checker' to isEnabled() check as well [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710094 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [13:04:19] (03CR) 10David Caro: prometheus: added some wmcs team label configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709471 (owner: 10David Caro) [13:04:36] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/709053 (owner: 10David Caro) [13:04:55] (03CR) 10Ema: [C: 03+2] cache: single backend experiment [puppet] - 10https://gerrit.wikimedia.org/r/710224 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [13:05:25] (03Merged) 10jenkins-bot: flink-session-cluster: Move image name and version under main_app [deployment-charts] - 10https://gerrit.wikimedia.org/r/708974 (https://phabricator.wikimedia.org/T287374) (owner: 10DCausse) [13:08:09] (03PS4) 10David Caro: prometheus: added some wmcs team label configs and default sre [puppet] - 10https://gerrit.wikimedia.org/r/709471 [13:08:31] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:38] (03CR) 10David Caro: prometheus: added some wmcs team label configs and default sre (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709471 (owner: 10David Caro) [13:08:54] (03PS2) 10DCausse: rdf-streaming-updater: Cleanup image tags under docker [deployment-charts] - 10https://gerrit.wikimedia.org/r/708975 (https://phabricator.wikimedia.org/T287374) [13:11:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:12:28] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:22:49] (03Merged) 10jenkins-bot: Add 'constraint-regex-checker' to isEnabled() check as well [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710095 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [13:22:52] (03Merged) 10jenkins-bot: Add 'constraint-regex-checker' to isEnabled() check as well [extensions/WikibaseQualityConstraints] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710094 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [13:25:11] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php: Backport: [[gerrit:710095|Add 'constraint-regex-checker' to isEnabled() check as well (T176312)]] (duration: 01m 19s) [13:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:19] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [13:26:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:27:58] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php: Backport: [[gerrit:710094|Add 'constraint-regex-checker' to isEnabled() check as well (T176312)]] (duration: 01m 06s) [13:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:28:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:45] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:34:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:35:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) @Jclark-ctr ACK! though in T280203 we have decom'ed about 20 servers in A that are completely out of production yet still have to be removed from the rack. C... [13:35:21] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:33] 10SRE, 10serviceops, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) 05Open→03Stalled this is only open due to a single remaining server, the mwmaint servers in codfw. this will be upgraded after we switch D... [13:37:53] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [13:38:05] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:38:44] (03PS1) 10Giuseppe Lavagetto: mwdebug: bump tiller resources in codfw as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/710260 [13:38:53] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:43] !log deleted reserved (not active) IP 103.102.166.5/28 from netbox (T284246) [13:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:51] T284246: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 [13:42:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:43:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: bump tiller resources in codfw as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/710260 (owner: 10Giuseppe Lavagetto) [13:43:11] (03PS1) 10JMeybohm: dragonfly: Switch codfw peers to codfw supernode [puppet] - 10https://gerrit.wikimedia.org/r/710261 (https://phabricator.wikimedia.org/T286054) [13:43:55] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh5002.wikimedia.org [13:43:55] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh5002.wikimedia.org [13:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:11] (03CR) 10JMeybohm: [C: 03+2] dragonfly: Switch codfw peers to codfw supernode [puppet] - 10https://gerrit.wikimedia.org/r/710261 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [13:45:25] (03Merged) 10jenkins-bot: mwdebug: bump tiller resources in codfw as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/710260 (owner: 10Giuseppe Lavagetto) [13:46:31] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:33] (03PS1) 10Ssingh: wikidough: refactor and move the landing page to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/710262 [13:47:03] (03CR) 10JMeybohm: prometheus::ops: Scrape metrics from dfdaemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709704 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [13:48:01] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:48:01] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:08] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:03] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:44] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) Your replacement part associated with RMA R200361905 Item # 100 has been successfully shipped. Details of which are provided below. Replacement Serial Number: R... [13:50:01] (03CR) 10Hnowlan: [C: 03+2] maps: reimage maps2006 as a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/710231 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [13:51:36] (03PS1) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/710263 [13:53:05] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:09] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:36] (03PS4) 10Hnowlan: maps: reimage maps1010 as buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702619 (https://phabricator.wikimedia.org/T269582) [13:58:15] (03CR) 10Hnowlan: [C: 03+2] maps: reimage maps1010 as buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702619 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [14:00:59] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:45] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:08:05] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:12] (03PS1) 10Marostegui: mariadb: Productionize dbproxy2004 [puppet] - 10https://gerrit.wikimedia.org/r/710266 (https://phabricator.wikimedia.org/T288093) [14:10:19] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2006.codfw.wmnet with reason: REIMAGE [14:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:41] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:12:32] (03PS2) 10Marostegui: mariadb: Productionize dbproxy2004 [puppet] - 10https://gerrit.wikimedia.org/r/710266 (https://phabricator.wikimedia.org/T288093) [14:12:47] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2006.codfw.wmnet with reason: REIMAGE [14:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy2004 [puppet] - 10https://gerrit.wikimedia.org/r/710266 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [14:14:00] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host doh5002.wikimedia.org [14:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:53] (03PS1) 10Marostegui: production-m5.sql: Add codfw proxy user [puppet] - 10https://gerrit.wikimedia.org/r/710267 (https://phabricator.wikimedia.org/T288093) [14:16:00] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1010.eqiad.wmnet with reason: REIMAGE [14:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:59] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add codfw proxy user [puppet] - 10https://gerrit.wikimedia.org/r/710267 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [14:17:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:17:24] (03PS1) 10Bartosz Dziewoński: Update preferences language for source mode toolbar [extensions/DiscussionTools] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710096 (https://phabricator.wikimedia.org/T287315) [14:17:33] (03PS1) 10Bartosz Dziewoński: Update preferences language for source mode toolbar [extensions/DiscussionTools] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710097 (https://phabricator.wikimedia.org/T287315) [14:17:41] (03PS1) 10Bartosz Dziewoński: Change 'sourcemodetoolbar' default to enabled (when available) [extensions/DiscussionTools] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710098 (https://phabricator.wikimedia.org/T287927) [14:17:49] (03PS1) 10Bartosz Dziewoński: Change 'sourcemodetoolbar' default to enabled (when available) [extensions/DiscussionTools] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710099 (https://phabricator.wikimedia.org/T287927) [14:18:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:18:27] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on maps1010.eqiad.wmnet with reason: REIMAGE [14:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:52] (03PS2) 10Bartosz Dziewoński: DiscussionTools: Make 'sourcemodetoolbar' available everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710075 (https://phabricator.wikimedia.org/T287927) [14:22:32] (03PS1) 10Marostegui: dbproxy2004: Fix port [puppet] - 10https://gerrit.wikimedia.org/r/710268 (https://phabricator.wikimedia.org/T288093) [14:23:06] (03CR) 10Marostegui: [C: 03+2] dbproxy2004: Fix port [puppet] - 10https://gerrit.wikimedia.org/r/710268 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [14:23:21] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::ops: Scrape metrics from dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/709704 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [14:24:23] PROBLEM - Check systemd state on maps1010 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.6: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh5002.wikimedia.org [14:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:03] PROBLEM - Check the NTP synchronisation status of timesyncd on maps1010 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.6: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [14:26:35] (03PS1) 10Dzahn: acme_chief: allow doh5002 to request wikidough certs [puppet] - 10https://gerrit.wikimedia.org/r/710269 (https://phabricator.wikimedia.org/T284246) [14:27:14] re: maps1010 alerts - the "could not connect" part tells us it's NRPE (nagios-nrpe-server) on the host itself [14:27:19] PROBLEM - cassandra CQL 10.64.48.6:9042 on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:27:19] PROBLEM - Check whether ferm is active by checking the default input chain on maps1010 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.6: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:27:30] it gets killed first often by OOM killer [14:29:45] PROBLEM - Host maps1010 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:17] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on maps1010.eqiad.wmnet with reason: Reimaging [14:30:19] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on maps1010.eqiad.wmnet with reason: Reimaging [14:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:37] but in this case it's this, so my comment did not apply:) [14:30:45] RECOVERY - Host maps1010 is UP: PING OK - Packet loss = 0%, RTA = 2.28 ms [14:30:53] mutante: yeah, seems the downtime failed when the reimage script ran [14:31:38] hnowlan: ACK, I had that issue as well yesterday with some hosts [14:31:47] it's a race [14:32:03] sometimes it tries to send the downtime right before it's back in puppetdb or so [14:32:12] then it cant find the host but moments later it can [14:32:13] (03CR) 10DLynch: [C: 03+1] Update preferences language for source mode toolbar [extensions/DiscussionTools] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710097 (https://phabricator.wikimedia.org/T287315) (owner: 10Bartosz Dziewoński) [14:32:36] (03CR) 10DLynch: [C: 03+1] Update preferences language for source mode toolbar [extensions/DiscussionTools] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710096 (https://phabricator.wikimedia.org/T287315) (owner: 10Bartosz Dziewoński) [14:32:53] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:34:11] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:34:18] (03PS1) 10Dzahn: DHCP: add MAC address of doh5002 [puppet] - 10https://gerrit.wikimedia.org/r/710271 (https://phabricator.wikimedia.org/T284246) [14:34:43] mutante: <3 [14:34:46] (03PS2) 10Dzahn: DHCP: add MAC address of doh5002 [puppet] - 10https://gerrit.wikimedia.org/r/710271 (https://phabricator.wikimedia.org/T284246) [14:34:57] (03PS5) 10David Caro: prometheus.icinga_exporter: Add label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709053 [14:34:59] (03PS7) 10David Caro: profile.icinga_exporter: Added label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709054 [14:35:01] (03PS5) 10David Caro: prometheus: added some wmcs team label configs and default sre [puppet] - 10https://gerrit.wikimedia.org/r/709471 [14:35:42] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC address of doh5002 [puppet] - 10https://gerrit.wikimedia.org/r/710271 (https://phabricator.wikimedia.org/T284246) (owner: 10Dzahn) [14:35:45] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator is disabled on buster hosts https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:45] ACKNOWLEDGEMENT - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator is disabled on buster hosts https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:46] cheers sukhe:) brb [14:39:22] (03PS1) 10Elukey: Release upstream version 3.6.3 [debs/helm3] - 10https://gerrit.wikimedia.org/r/710273 [14:40:12] (03CR) 10Elukey: [V: 03+2 C: 03+2] Release upstream version 3.6.3 [debs/helm3] - 10https://gerrit.wikimedia.org/r/710273 (owner: 10Elukey) [14:41:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:42:25] mutante: o/ releasing your helm3 control change as well :) [14:42:57] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:46:53] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:49:08] (03PS1) 10David Caro: am: Fix typo in parameter name. [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710275 [14:50:11] !log upload helm 3.6.3-1 to {buster,stretch}-wikimedia [14:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:33] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:51:05] (03CR) 10Filippo Giunchedi: [C: 03+2] am: Fix typo in parameter name. [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710275 (owner: 10David Caro) [14:51:26] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: Fix typo in parameter name. [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/710275 (owner: 10David Caro) [14:52:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) [14:52:38] !log depool lvs2008 - T286881 [14:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:46] T286881: Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 [14:53:52] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2008.codfw.wmnet with reason: T286881 [14:53:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2008.codfw.wmnet with reason: T286881 [14:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:06] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: T286881 ` lvs2008.codfw.wmnet ` [14:54:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) @Dzahn that can be done but it will delay the last 3 mw servers until we can get time to remove the decom servers and then unracking and de-cabling the 3... [14:55:19] (03Abandoned) 10Urbanecm: Update train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702248 (owner: 10TrainBranchBot) [14:55:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) [14:55:23] (03Abandoned) 10Urbanecm: Update train-versions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702247 (owner: 10TrainBranchBot) [14:55:44] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:06] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:57:08] ^^ that's triggered by the depool of lvs2008 [14:57:32] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [14:59:24] RECOVERY - Check whether ferm is active by checking the default input chain on maps1010 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:59:24] RECOVERY - Check the NTP synchronisation status of timesyncd on maps1010 is OK: OK: synced at Thu 2021-08-05 14:59:23 UTC. https://wikitech.wikimedia.org/wiki/NTP [15:02:52] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Cleanup image tags under docker [deployment-charts] - 10https://gerrit.wikimedia.org/r/708975 (https://phabricator.wikimedia.org/T287374) (owner: 10DCausse) [15:03:25] (03PS1) 10Ema: pontoon: initialize new stack traffic [puppet] - 10https://gerrit.wikimedia.org/r/710279 [15:03:27] (03PS1) 10Ema: pontoon: add cptext and cpupload [puppet] - 10https://gerrit.wikimedia.org/r/710280 [15:03:29] (03PS1) 10Ema: pontoon: add hiera settings for traffic stack [puppet] - 10https://gerrit.wikimedia.org/r/710281 [15:05:29] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) To replicate what I had found in the task description ([[ https://wiki.postgresql.org/wiki/Disk_Usage | source ]]) here is the same data... [15:05:31] (03Merged) 10jenkins-bot: rdf-streaming-updater: Cleanup image tags under docker [deployment-charts] - 10https://gerrit.wikimedia.org/r/708975 (https://phabricator.wikimedia.org/T287374) (owner: 10DCausse) [15:05:58] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01808 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [15:09:07] (03CR) 10JMeybohm: [C: 03+2] prometheus::ops: Scrape metrics from dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/709704 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [15:10:07] the thanos compact thing was a block with invalid checksum :( don't have the bandwidth to look into it now though but not urgent [15:10:14] !log pool lvs2008 - T286881 [15:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:21] T286881: Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 [15:10:52] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 72, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:11:07] !log depool lvs2009 - T286881 [15:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:16] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2009.codfw.wmnet with reason: T286881 [15:11:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2009.codfw.wmnet with reason: T286881 [15:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:30] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: T286881 ` lvs2009.codfw.wmnet ` [15:11:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:11:35] (03CR) 10Herron: [C: 03+1] sre: add alerting cluster puppet fail [alerts] - 10https://gerrit.wikimedia.org/r/710248 (https://phabricator.wikimedia.org/T283151) (owner: 10Filippo Giunchedi) [15:11:54] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 101, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:06] (03PS1) 10Hnowlan: maps: reenable tilerator in imposm cluster [puppet] - 10https://gerrit.wikimedia.org/r/710286 (https://phabricator.wikimedia.org/T269582) [15:12:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:12:58] (03CR) 10MSantos: [C: 03+1] maps: reenable tilerator in imposm cluster [puppet] - 10https://gerrit.wikimedia.org/r/710286 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [15:13:13] (03CR) 10Hnowlan: [C: 03+2] maps: reenable tilerator in imposm cluster [puppet] - 10https://gerrit.wikimedia.org/r/710286 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [15:15:24] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:16:26] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:16:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:16:56] BGP @ codfw is the depool of lvs2009 [15:18:04] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:19:19] (03PS1) 10Filippo Giunchedi: alertmanager: route Icinga alerts before team routes [puppet] - 10https://gerrit.wikimedia.org/r/710287 [15:19:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:20:02] PROBLEM - SSH on mw1305.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:28] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:12] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:23:38] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:35] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) I've moved netbox details (console and ethernet connection, IP addressing) from old device to the replacement device now, reflecting t... [15:25:50] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: added some wmcs team label configs and default sre [puppet] - 10https://gerrit.wikimedia.org/r/709471 (owner: 10David Caro) [15:26:16] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: initialize new stack traffic [puppet] - 10https://gerrit.wikimedia.org/r/710279 (owner: 10Ema) [15:26:25] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: add cptext and cpupload [puppet] - 10https://gerrit.wikimedia.org/r/710280 (owner: 10Ema) [15:26:34] hashar: wow I went to a meeting and you already merged my changes, thanks <3 [15:26:48] elukey: yeah sometime I am quick enough! [15:27:09] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: Deploy imposm to maps2006 [15:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:29] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: Deploy imposm to maps2006 (duration: 00m 20s) [15:27:31] hashar: basically all the time, but this one was particularly fast :D [15:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:56] (03PS2) 10Herron: logstash: add logstash203[345] to codfw elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709732 (https://phabricator.wikimedia.org/T287938) [15:28:13] elukey: I should probably promote a few SRE to act on those changes [15:28:20] operations/dns or operations/puppet also use custom images [15:28:27] but well .. eventually ;) [15:28:37] !log pool lvs2009 - T286881 [15:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:44] T286881: Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 [15:28:56] (03CR) 10Herron: [C: 03+2] logstash: add logstash203[345] to codfw elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709732 (https://phabricator.wikimedia.org/T287938) (owner: 10Herron) [15:29:12] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 101, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:29:23] hashar: I am going to rollout helm3 slowly in all nodes listed in https://debmonitor.wikimedia.org/packages/helm3, there is also contintXXXX, lemme know if it is ok or not [15:29:42] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 72, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:30:05] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) [15:34:09] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) I've also found this [[ https://tickets.puppetlabs.com/browse/PDB-4830 | PuppetDB issue ]] that might be related, even if not per-se dir... [15:38:11] (03PS1) 10PipelineBot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/710290 [15:38:17] (03PS2) 10Elukey: Improve the kubeflow-kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/710226 (https://phabricator.wikimedia.org/T272919) [15:41:54] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) I think that we should definitely filter the more spamming facts outlined above, that should reduce the size of the table and complexity... [15:42:02] RECOVERY - cassandra CQL 10.64.48.6:9042 on maps1010 is OK: TCP OK - 0.000 second response time on 10.64.48.6 port 9042 https://phabricator.wikimedia.org/T93886 [15:42:38] (03PS2) 10Ssingh: wikidough: refactor and move the landing page to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/710262 [15:45:17] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/710290 (owner: 10PipelineBot) [15:45:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:48:19] (03CR) 10Elukey: [C: 04-1] Improve the kubeflow-kfserving chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/710226 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:48:36] (03Merged) 10jenkins-bot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/710290 (owner: 10PipelineBot) [15:48:38] (03PS3) 10Ssingh: wikidough: refactor and move the landing page to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/710262 [15:48:54] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Elitre) Just noting that the newly made page was pretty much "orphan" - most of the docs re: Maps live on mw.org, so I wen... [15:49:14] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:50:34] (03PS3) 10Elukey: Improve the kubeflow-kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/710226 (https://phabricator.wikimedia.org/T272919) [15:52:32] !log upgrade helm3 to 3.6.3-1 on deploy1002 [15:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:40] (03PS1) 10Ahmon Dancy: Add 3 additional packages to php7.2-cli [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710294 [15:54:59] !log rolling restart codfw logstash elasticsearch cluster for java updates [15:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:05] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0ea1846] (imposm): tegola: mirror 5% of requests everywhere [15:57:08] !log mbsantos@deploy1002 deploy aborted: tegola: mirror 5% of requests everywhere (duration: 00m 03s) [15:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:58:44] (03CR) 10Elukey: [C: 03+2] Improve the kubeflow-kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/710226 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:58:54] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0ea1846]: tegola: mirror 5% of requests everywhere [15:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:16] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0ea1846]: tegola: mirror 5% of requests everywhere (duration: 00m 22s) [15:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:30] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [15:59:30] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0ea1846]: maps2008: tegola: mirror 5% of requests everywhere [15:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:51] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0ea1846]: maps2008: tegola: mirror 5% of requests everywhere (duration: 00m 21s) [15:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0ea1846]: maps2009: tegola: mirror 5% of requests everywhere [16:00:04] jbond42 and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T1600). [16:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:46] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:00:54] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:00:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:00:58] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0ea1846]: maps2009: tegola: mirror 5% of requests everywhere (duration: 00m 55s) [16:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:24] (03CR) 10Ahmon Dancy: "Joe, this moves the package adds made by you in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/700843/2/.pipeline/config.y" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710294 (owner: 10Ahmon Dancy) [16:01:55] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0ea1846]: maps2010: tegola: mirror 5% of requests everywhere [16:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1006.eqiad.wmnet [16:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:16] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0ea1846]: maps2010: tegola: mirror 5% of requests everywhere (duration: 00m 21s) [16:02:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:02:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:02:22] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1006.eqiad.wmnet with reason: Rebuilding as buster replica of maps1009 [16:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:24] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1006.eqiad.wmnet with reason: Rebuilding as buster replica of maps1009 [16:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:53] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Add local hostname to dfdaemon certificate [puppet] - 10https://gerrit.wikimedia.org/r/710295 (https://phabricator.wikimedia.org/T286054) [16:03:49] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0ea1846]: maps2006: tegola: mirror 5% of requests everywhere [16:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:04:09] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30493/console" [puppet] - 10https://gerrit.wikimedia.org/r/710295 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [16:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:13] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0ea1846]: maps2006: tegola: mirror 5% of requests everywhere (duration: 00m 24s) [16:04:16] !log draining maps1006 from maps cassandra cluster [16:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:07] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@16dbc04]: maps2010: imposm: add codfw targets [16:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:30] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@16dbc04]: maps2010: imposm: add codfw targets (duration: 00m 22s) [16:09:32] (03PS2) 10Dzahn: acme_chief: allow doh5002 to request wikidough certs [puppet] - 10https://gerrit.wikimedia.org/r/710269 (https://phabricator.wikimedia.org/T284246) [16:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:44] (03PS1) 10Elukey: helmfile: allow kubeflow-kfserving to create the kfserving-system ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/710296 (https://phabricator.wikimedia.org/T272919) [16:10:02] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@16dbc04]: maps2009: imposm: add codfw targets [16:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:23] (03CR) 10Ssingh: [C: 03+2] acme_chief: allow doh5002 to request wikidough certs [puppet] - 10https://gerrit.wikimedia.org/r/710269 (https://phabricator.wikimedia.org/T284246) (owner: 10Dzahn) [16:10:31] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@16dbc04]: maps2009: imposm: add codfw targets (duration: 00m 29s) [16:10:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:59] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@16dbc04]: maps2008: imposm: add codfw targets [16:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:22] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@16dbc04]: maps2008: imposm: add codfw targets (duration: 00m 23s) [16:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:39] (03CR) 10Elukey: [C: 03+2] helmfile: allow kubeflow-kfserving to create the kfserving-system ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/710296 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [16:12:52] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@16dbc04]: maps2007: imposm: add codfw targets [16:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:17] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@16dbc04]: maps2007: imposm: add codfw targets (duration: 00m 25s) [16:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:08] (03PS1) 10Ladsgroup: Enable shellbox for constraints for 1% of wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710297 (https://phabricator.wikimedia.org/T176312) [16:14:19] (03PS1) 10Dzahn: site: add doh5002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/710298 (https://phabricator.wikimedia.org/T284246) [16:14:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:14:46] !log mbsantos@deploy1002 Started deploy [tilerator/deploy@16dbc04]: maps2006: imposm: add codfw targets [16:14:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:53] (03PS2) 10Dzahn: site: add doh5002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/710298 (https://phabricator.wikimedia.org/T284246) [16:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:08] !log mbsantos@deploy1002 Finished deploy [tilerator/deploy@16dbc04]: maps2006: imposm: add codfw targets (duration: 00m 22s) [16:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:51] (03CR) 10Dzahn: [C: 03+2] site: add doh5002 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/710298 (https://phabricator.wikimedia.org/T284246) (owner: 10Dzahn) [16:16:01] (03PS4) 10Ssingh: wikidough: refactor and move the landing page to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/710262 [16:16:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:16:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:49] (03CR) 10Dzahn: [C: 03+1] "lgtm, afaict" [puppet] - 10https://gerrit.wikimedia.org/r/710262 (owner: 10Ssingh) [16:20:44] RECOVERY - SSH on mw1305.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:20:52] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] dragonfly::dfdaemon: Add local hostname to dfdaemon certificate [puppet] - 10https://gerrit.wikimedia.org/r/710295 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [16:20:56] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10Papaul) [16:21:09] (03CR) 10Ottomata: "These ports will rarely be used; but I did use them when I was originally developing and performance testing this chart." [deployment-charts] - 10https://gerrit.wikimedia.org/r/710111 (https://phabricator.wikimedia.org/T255871) (owner: 10Ottomata) [16:21:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:21:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable shellbox for constraints for 1% of wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710297 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [16:27:38] (03PS1) 10Elukey: Kubeflow: fix secrete in chart and update helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/710301 (https://phabricator.wikimedia.org/T272919) [16:28:27] (03CR) 10Brennen Bearnes: "Just noticed John's on vacation for a bit...." [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [16:31:02] (03PS2) 10Elukey: Kubeflow: fix secret in chart and update helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/710301 (https://phabricator.wikimedia.org/T272919) [16:33:45] (03PS1) 10Zabe: Enable GeoData on zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710303 (https://phabricator.wikimedia.org/T287807) [16:34:50] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) I did some additional investigation on the edges table and so far I found this: - did query a random host for this API: `curl -vvo test... [16:34:58] RECOVERY - Host cloudvirt1038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.94 ms [16:35:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:35:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10Cmjohnson) I will be able to take a look at this later today or first thing tomorrow. I briefly looked yesterday but it's one of t... [16:36:05] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:14] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) We are close to moving these to A7 now. Several MW's have been decom'd and John and I need to get them out of the rack. Looking to have this done... [16:37:46] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:38:50] (03CR) 10Elukey: [C: 03+2] Kubeflow: fix secret in chart and update helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/710301 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [16:39:14] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [16:40:04] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:42:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:42:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:04] (03CR) 10Addshore: [C: 03+1] "woo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710297 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [16:45:06] (03PS2) 10Ladsgroup: Enable shellbox for constraints for 1% of wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710297 (https://phabricator.wikimedia.org/T176312) [16:45:17] (03CR) 10Ladsgroup: [C: 03+2] "Deploying \o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710297 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [16:45:46] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) I deleted the reserved IP mentioned above and then could run the cookbook again. VM has been created now, has been added to DHCP and OS installed.... [16:46:03] (03Merged) 10jenkins-bot: Enable shellbox for constraints for 1% of wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710297 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [16:46:34] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Ottomata) Thank you! [16:47:03] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) 05Open→03Resolved [16:47:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1010.eqiad.wmnet [16:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:01] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710297|Enable shellbox for constraints for 1% of wikidata (T176312)]] (duration: 01m 27s) [16:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:11] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [16:48:21] Amir1: what dashboard should i watch? [16:48:51] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2006.codfw.wmnet [16:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:10] oooh https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-constraints&var-release=main [16:49:15] well the requests certainly went up :P [16:49:44] addshore: https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-constraints&var-release=main [16:49:49] opps [16:49:50] sukhe: you should be able to ssh to doh5002 now. it has "insetup" role and first puppet run finished. you can apply the actual role whenever [16:49:58] https://grafana.wikimedia.org/d/000000344/wikidata-quality?orgId=1&from=now-7d&to=now [16:50:04] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10ssingh) Thanks very much for the help, @Dzahn! [16:50:08] mutante: thanks! [16:50:14] addshore, Amir1: I’m also looking at https://grafana.wikimedia.org/d/000000344/wikidata-quality?viewPanel=7&orgId=1&from=now-3h&to=now&refresh=30s, Q21502404_FormatChecker (not to be confused with the *other* FormatChecker line) [16:50:19] sukhe: you're welcome [16:50:21] also I appreciate the reminder text to not use @mutante because that's my impulse :) [16:50:34] (on phabricator!) [16:50:35] though that’s probably too noisy for us to be able to see any difference at a 1% rate [16:50:49] Amir1: do we have space to go higher than 1% in this slot? [16:50:49] haha yea, once I created that by accident and it was not trivial to delete a user [16:50:51] jouncebot: now [16:50:51] For the next 0 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T1600) [16:50:51] 10SRE, 10Fundraising-Backlog, 10Traffic, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DStrine) @JBennett @BBlack @Dwisehaupt @Jgreen I'm hearing that the email service provider (now branded acoustic) is getting higher ratings. What... [16:50:55] so that was the next best thing [16:50:55] jouncebot: Nemo_bis [16:51:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:51:11] bah, sorry.... no auto complete in here ... not for jouncebot commands anyway ... [16:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:16] jouncebot next [16:51:16] In 0 hour(s) and 8 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T1700) [16:51:40] addshore: we definitely can but if legoktm think it's fine. I want to make sure no errors showup in logs [16:51:45] +1 [16:51:53] mutante: :D [16:52:05] maybe waiting for a ten minutes or half an hour [16:52:16] specially since tomorrow I will be probably gone [16:52:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:19] Amir1: grafana seems to indicate that at least some requests returned a 500? [16:54:50] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:55:01] yeah, I think SPARQL also responds with 500 sometimes too [16:55:10] but the ratio shouldn't be bad enough [16:55:13] I don't know if it is [16:55:34] yeah, but I wonder what caused the 500, are the shellbox logs visible somewhere? [16:55:36] it can also be deploying, termination of the pod, etc. [16:56:06] yeah, I need to check [17:00:05] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T1700) [17:01:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10jcrespo) I am shutting it down and downtiming it until Monday just in case. [17:01:41] (03CR) 10Ssingh: wikidough: refactor and move the landing page to a separate file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710262 (owner: 10Ssingh) [17:04:12] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01662 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [17:07:49] addshore: I found some stuff with shellbox but it's all empty messages https://logstash.wikimedia.org/goto/cb2833718969c6000db61714e9a425a8 [17:08:26] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:08:27] The `log` value says somehting at least [17:09:06] (03PS1) 10Elukey: kubeflow-kfserving: better handling of Secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/710307 (https://phabricator.wikimedia.org/T272919) [17:09:38] `proxy:fcgi://127.0.0.1:9000/500 597 GET` does that mean that `/metrics` is getting 500s? [17:10:49] I see the odd 500 with `proxy:fcgi://127.0.0.1:9000/500 1197 POST http://localhost:6025/call/constraint-regex-checker` too, but nothing in the logs to help :) [17:11:52] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving: better handling of Secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/710307 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [17:13:08] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:44] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:15] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10Papaul) [17:21:42] addshore: do you want to debug or you want to increase the % to a higher one first? [17:21:54] I think we should just go higher a bit :) [17:22:06] okay, let's go with 10%? [17:22:15] 5 or 10, you decide :) [17:22:30] five sounds better :D [17:22:52] (03PS3) 10Herron: logstash: add logstash103[345] to eqiad elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709731 (https://phabricator.wikimedia.org/T287938) [17:23:37] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@9872df9]: pyspark generalization gerrit:709837 and 666774 [17:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:58] (03CR) 10Herron: [C: 03+2] logstash: add logstash103[345] to eqiad elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709731 (https://phabricator.wikimedia.org/T287938) (owner: 10Herron) [17:25:08] !log end of pdf rebuild on commonswiki (T275268) [17:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:15] T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 [17:25:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:19] Amir1: Wow. Congratulations. [17:26:32] \o/ [17:26:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:26:46] Now to fix DjVu files too, right? ;-) [17:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:51] One fifth of its original size, the half of it is now djvu mess [17:26:59] :-( [17:27:19] (03PS1) 10Phuedx: wikimediaEvents: Enable IP address copy action instrument on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710313 (https://phabricator.wikimedia.org/T279540) [17:27:40] I have looked around to find a php library that reads djvu metadata but I couldn't find one :((( [17:27:51] We probably need to write from scratch I think [17:29:22] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1241687 MB (15% inode=78%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [17:32:38] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@9872df9]: pyspark generalization gerrit:709837 and 666774 (duration: 09m 01s) [17:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:12] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:36:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:36:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:05] (03PS1) 10Ladsgroup: Increase the shellbox ratio to 5% for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710315 (https://phabricator.wikimedia.org/T176312) [17:41:31] !log restart airflow-{scheduler|webserver} on an-airflow1001 to pickup deployed plugin changes [17:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:05] !log rolling restart eqiad logstash cluster for java updates [17:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:57] !log upgrade helm3 to 3.6.3-1 on release*, contint*, chartmuseum*, deploy2002 (1002 was already done before) [17:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:08] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) [17:46:08] (03PS2) 10Ladsgroup: Increase the shellbox ratio to 5% for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710315 (https://phabricator.wikimedia.org/T176312) [17:46:13] (03CR) 10Ladsgroup: [C: 03+2] Increase the shellbox ratio to 5% for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710315 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [17:46:57] (03Merged) 10jenkins-bot: Increase the shellbox ratio to 5% for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710315 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [17:49:46] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710315|Increase the shellbox ratio to 5% for wikidata (T176312)]] (duration: 01m 15s) [17:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:53] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [17:51:27] Amir1: nom nom nom [17:52:13] looks like more would be fine too [17:52:56] :D [17:53:01] will go up soon [17:54:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:24] ACKNOWLEDGEMENT - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1205869 MB (15% inode=78%): andrew bogott brooke is on top of this https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [17:54:39] Amir1: 15 or 12? ;) [17:54:47] 42% [17:55:04] that's the answer [17:55:05] heh, i didnt mean to say 12 ... [17:55:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:19] Amir1: 21 then 42? [17:56:29] okay :D [17:57:50] (03PS1) 10Ladsgroup: Increase the ratio for shellbox for constraints to 21% in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710318 (https://phabricator.wikimedia.org/T176312) [17:58:10] addshore: can we get this over time for format checker? https://grafana.wikimedia.org/d/000000344/wikidata-quality?viewPanel=1&orgId=1&from=now-7d&to=now [17:58:28] (03CR) 10jerkins-bot: [V: 04-1] Increase the ratio for shellbox for constraints to 21% in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710318 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [17:58:32] I do no know! [17:59:30] as the % increases https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-job=constraintsRunCheck&from=now-1h&to=now will start being interesting to look at too [18:00:05] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T1800). [18:00:05] MatmaRex and phuedx: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:17] oh nice [18:00:20] yeah [18:00:35] o/ [18:00:35] let's wait for BACC to finish, I'll do it right after [18:00:46] please ping me once you're done! [18:00:49] Amir1: wanna lead window? :D [18:00:51] or should i? [18:00:56] hi [18:00:58] urbanecm: please go ahead [18:01:35] (03CR) 10Urbanecm: [C: 03+2] Change 'sourcemodetoolbar' default to enabled (when available) [extensions/DiscussionTools] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710098 (https://phabricator.wikimedia.org/T287927) (owner: 10Bartosz Dziewoński) [18:01:40] (03CR) 10Urbanecm: [C: 03+2] Change 'sourcemodetoolbar' default to enabled (when available) [extensions/DiscussionTools] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710099 (https://phabricator.wikimedia.org/T287927) (owner: 10Bartosz Dziewoński) [18:02:06] (03PS2) 10Ladsgroup: Increase the ratio for shellbox for constraints to 21% in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710318 (https://phabricator.wikimedia.org/T176312) [18:05:05] addshore: oh the 500 can be the timeout [18:05:29] MatmaRex: is the i18n change _really really_ necessary? [18:05:40] backporting i18n changes is...very time expensive, and we don't like to do it 🙂 [18:06:14] uhh [18:06:19] (03PS2) 10Urbanecm: wikimediaEvents: Enable IP address copy action instrument on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710313 (https://phabricator.wikimedia.org/T279540) (owner: 10Phuedx) [18:06:51] (it needs an i18n cache rebuild, which is what takes time) [18:06:52] urbanecm: can we merge it, and let the train run later today update the localisation cache? [18:07:09] if not, then i guess i can just override the messages on-wiki [18:07:23] that won't work -- train promotions are only config changes [18:07:40] the only time train conductors run full scap is when they push train to the test wikis shortly after branching :) [18:07:51] (03Merged) 10jenkins-bot: Change 'sourcemodetoolbar' default to enabled (when available) [extensions/DiscussionTools] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710098 (https://phabricator.wikimedia.org/T287927) (owner: 10Bartosz Dziewoński) [18:07:55] (03Merged) 10jenkins-bot: Change 'sourcemodetoolbar' default to enabled (when available) [extensions/DiscussionTools] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710099 (https://phabricator.wikimedia.org/T287927) (owner: 10Bartosz Dziewoński) [18:08:03] if it's just few wikis, doing it on wiki is probably easier. [18:08:22] if it's all wikis and you need it to happen today for some reason, i can run full scap, i'd just like to know why it's urgent [18:09:17] If it's only changing messages in a few locales... it shouldn't take too long [18:09:17] it's technically all wikis, but it probably doesn't matter much for most of them [18:09:19] in theory :D [18:09:33] we wanted to do it because we like to babysit enwiki, i guess [18:09:34] (03CR) 10Urbanecm: [C: 03+2] wikimediaEvents: Enable IP address copy action instrument on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710313 (https://phabricator.wikimedia.org/T279540) (owner: 10Phuedx) [18:09:36] addshore: https://grafana-rw.wikimedia.org/d/000000378/ladsgroup-test?viewPanel=10&orgId=1&from=now-3h&to=now [18:09:45] it's moving average of half an hour [18:09:50] but i can just create the override there [18:10:19] so, i'm fine with skipping that patch [18:10:25] (03Merged) 10jenkins-bot: wikimediaEvents: Enable IP address copy action instrument on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710313 (https://phabricator.wikimedia.org/T279540) (owner: 10Phuedx) [18:10:29] I'd be happier with that tbh [18:10:35] I wonder if we could instrument scap sync-world. Who doesn't love a dashboard? [18:10:48] instrument in what sense? [18:11:11] MatmaRex: your other backport is at mwdebug2001 for tests [18:11:15] phuedx: your config change is there too [18:11:24] phuedx: is suggesting we make a nice melody play whenever someone scaps, right?! ;) [18:11:38] Amir1: one of them still sys 1d? no? [18:11:38] something like hatnote? :)) [18:11:38] ottomata: I AM NOW [18:11:59] yeah, that's 99 percentile because it JUMPS [18:12:21] that's why it's on the right axis [18:12:26] (03CR) 10Urbanecm: [C: 03+2] DiscussionTools: Make 'sourcemodetoolbar' available everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710075 (https://phabricator.wikimedia.org/T287927) (owner: 10Bartosz Dziewoński) [18:12:31] aaah, different axis ;) [18:13:03] urbanecm: Instrument in the sense that we send timings of the various scap procedures to statsd [18:13:12] Also, make it play a song [18:13:13] urbanecm: looks good [18:13:17] addshore: ahh, I forgot in UK right is left and left is wrong [18:13:19] thanks MatmaRex, syncing [18:13:22] Highest priority last [18:13:27] :D [18:13:32] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:13:43] I'm pretty certain it's i18n cache rebuild taking most of the time :D [18:13:58] (03PS3) 10Urbanecm: DiscussionTools: Make 'sourcemodetoolbar' available everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710075 (https://phabricator.wikimedia.org/T287927) (owner: 10Bartosz Dziewoński) [18:14:02] (03CR) 10Urbanecm: [C: 03+2] DiscussionTools: Make 'sourcemodetoolbar' available everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710075 (https://phabricator.wikimedia.org/T287927) (owner: 10Bartosz Dziewoński) [18:14:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:14:34] phuedx: https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway) [18:15:00] (03Merged) 10jenkins-bot: DiscussionTools: Make 'sourcemodetoolbar' available everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710075 (https://phabricator.wikimedia.org/T287927) (owner: 10Bartosz Dziewoński) [18:15:14] phuedx: lmk what you think about your config change ;) [18:15:18] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:15:40] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [18:15:40] urbanecm: It's awesome. It LGTM too. Thanks [18:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:52] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/DiscussionTools/extension.json: 38a8658d81f16700accf0df68504a121ddf41ffb: Change sourcemodetoolbar default to enabled when available (T287927) (duration: 01m 06s) [18:15:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:15:58] phuedx: thanks, syncing [18:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:00] T287927: Make config change to expose source mode toolbar by default - https://phabricator.wikimedia.org/T287927 [18:16:12] !log urbanecm@deploy1002 sync-file aborted: 91f7c0233e2573a629e92a4b14c9b4be2b401e2f: Change sourcemodetoolbar default to enabled when available (T287927) (duration: 00m 04s) [18:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:58] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:24] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/DiscussionTools/extension.json: 91f7c0233e2573a629e92a4b14c9b4be2b401e2f: Change sourcemodetoolbar default to enabled when available (T287927) (duration: 01m 06s) [18:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:36] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:18:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0a14eb418288ad8ea25c206d20f2bed589de8107: wikimediaEvents: Enable IP address copy action instrument on all wikis (T279540) (duration: 01m 07s) [18:19:01] phuedx: should be live! [18:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:03] T279540: Instrument IP address copy action metric - https://phabricator.wikimedia.org/T279540 [18:19:11] Thanks, urbanecm! [18:19:16] MatmaRex: your config change available at mwdebug2001, please have a look [18:19:16] np phuedx [18:21:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:21:35] urbanecm: looks good! [18:21:41] thanks, syncing [18:22:44] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.03003 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [18:23:11] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: da36bc3a05101f56e357969371b91e05660b9560: DiscussionTools: Make sourcemodetoolbar available everywhere (T287927) (duration: 01m 06s) [18:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:18] T287927: Make config change to expose source mode toolbar by default - https://phabricator.wikimedia.org/T287927 [18:23:19] should be live! [18:23:26] so, we're done now i think :) [18:23:57] !log Adding peering to second router of Xiber LLC - AS393950 - on cr2-eqord (Equinix IX Chicago) [18:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:05] Amir1: you can go ahead [18:24:14] thanks [18:24:35] Thanks! [18:24:41] (03PS3) 10Ladsgroup: Increase the ratio for shellbox for constraints to 21% in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710318 (https://phabricator.wikimedia.org/T176312) [18:24:43] (03CR) 10Ladsgroup: [C: 03+2] Increase the ratio for shellbox for constraints to 21% in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710318 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [18:25:35] (03Merged) 10jenkins-bot: Increase the ratio for shellbox for constraints to 21% in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710318 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [18:28:09] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710318|Increase the ratio for shellbox for constraints to 21% in Wikidata (T176312)]] (duration: 01m 06s) [18:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:16] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [18:28:46] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:30:05] (03CR) 10Giuseppe Lavagetto: "The reason why I didn't include them here is that this image is used for other php-based images. If we want to move these package installa" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710294 (owner: 10Ahmon Dancy) [18:33:40] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:33:42] (03CR) 10Ahmon Dancy: Add 3 additional packages to php7.2-cli (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710294 (owner: 10Ahmon Dancy) [18:36:58] Amir1: 42? [18:37:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:37:14] on it [18:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:38:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:35] (03PS1) 10Ladsgroup: Increase the ratio for shellbox for constraints to 42% in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710328 (https://phabricator.wikimedia.org/T176312) [18:41:18] (03CR) 10Ladsgroup: [C: 03+2] Increase the ratio for shellbox for constraints to 42% in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710328 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [18:41:28] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:42:08] (03Merged) 10jenkins-bot: Increase the ratio for shellbox for constraints to 42% in Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710328 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [18:44:36] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710328|Increase the ratio for shellbox for constraints to 42% in Wikidata (T176312)]] (duration: 01m 06s) [18:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:44] T176312: Don’t check format constraint via SPARQL (safely evaluating user-provided regular expressions) - https://phabricator.wikimedia.org/T176312 [18:46:00] *looks at the dashboards [18:46:41] (03Abandoned) 10Bartosz Dziewoński: Update preferences language for source mode toolbar [extensions/DiscussionTools] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710096 (https://phabricator.wikimedia.org/T287315) (owner: 10Bartosz Dziewoński) [18:46:45] (03Abandoned) 10Bartosz Dziewoński: Update preferences language for source mode toolbar [extensions/DiscussionTools] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710097 (https://phabricator.wikimedia.org/T287315) (owner: 10Bartosz Dziewoński) [18:47:07] Amir1: its nice to see sparql throttling for the constraint checks at 0 now [18:47:57] https://grafana-rw.wikimedia.org/d/RKogW1m7z/shellbox?viewPanel=55&orgId=1&from=now-3h&to=now&forceLogin=true&var-dc=codfw%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-constraints&var-release=main [18:48:03] nice to see it all stay in the same bucket so far [18:48:14] yup [18:48:58] although there are regular outliers [18:49:05] https://grafana-rw.wikimedia.org/d/RKogW1m7z/shellbox?viewPanel=56&orgId=1&from=now-3h&to=now&forceLogin=true&var-dc=codfw%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-constraints&var-release=main [18:49:12] 1-3s [18:49:17] here and there throughout [18:49:21] that's normal? [18:49:32] is this actually latency, or request time? [18:50:05] it's measured from envoy from the outside [18:50:09] gotcha [18:50:18] i could imagine that some of the regexes are evil [18:50:31] I asusme from the appserver side, but I don't know if there is another envoy as a sidecar measuring this there instead [18:51:33] 1 second seems like a lot though. Is that hitting a timeout? what's the timeout configured as? [18:51:49] what's the longest we want to allow? also, are there retries in envoy? [18:52:40] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:53] Krinkle: it's 5 seconds [18:54:09] the envoy timeout is set to 10 seconds but app is set to five [18:54:21] Amir1: curl timemout? [18:54:29] yup [18:54:29] in shellbox client in mw? [18:54:31] k [18:55:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:56:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:56:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:58:00] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:58:48] addshore: https://grafana-rw.wikimedia.org/d/000000378/ladsgroup-test?viewPanel=10&orgId=1&from=now-3h&to=now it's now starting to go down [18:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:48] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:04] dduvall and twentyafterfour: (Dis)respected human, time to deploy MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T1900). Please do the needful. [19:00:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:00:09] Timing wise i dont see any indication that this is slower or faster than the old thing but thats good! [19:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:32] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:11:44] addshore, Amir1: are we clear for train? [19:11:58] i believe so, we ar enot touching things :) [19:12:06] yeah [19:12:24] k thanks! [19:13:37] * dduvall enqueues Crazy Train by Ozzy [19:14:03] "all aboooooard hahaha" [19:14:13] chooo chooo [19:14:19] (03PS1) 10Herron: logstash: extend ssd tier retention from 15 to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/710341 (https://phabricator.wikimedia.org/T287938) [19:14:31] twentyafterfour: o/ rolling momentarily [19:15:45] (03PS1) 10Dduvall: all wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710342 [19:15:47] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710342 (owner: 10Dduvall) [19:16:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:17:01] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710342 (owner: 10Dduvall) [19:18:44] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.17 [19:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:19:54] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:46] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:08] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [19:29:41] DannyS712, Pchelolo: i overlooked https://phabricator.wikimedia.org/T288191 what's the status on that? should i rollback/block? [19:31:54] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.03583 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [19:35:57] (03PS5) 10Dave Pifke: xhgui: enable database access for admins [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) [19:36:59] bpirkle: i see you cc'd on the fix for T288191. please advise ^ i can either rollback train and block or we can backport the fix, but i need guidance [19:36:59] T288191: substituting {{#tag:ref}} tags and templates; and pipe tricks fail in Page namespace - https://phabricator.wikimedia.org/T288191 [19:44:00] (03PS5) 10Ssingh: wikidough: refactor and move the landing page to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/710262 [19:44:08] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:05] (03PS6) 10Dave Pifke: xhgui: enable database access for admins [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) [19:45:56] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:21] (03PS6) 10Ssingh: wikidough: refactor and move the landing page to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/710262 [19:47:38] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30499/console" [puppet] - 10https://gerrit.wikimedia.org/r/710262 (owner: 10Ssingh) [19:49:03] (03CR) 10Ssingh: [V: 03+1 C: 03+2] wikidough: refactor and move the landing page to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/710262 (owner: 10Ssingh) [19:50:32] (03PS7) 10Dave Pifke: xhgui: enable database access for admins [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) [19:51:37] (03PS1) 10Clare Ming: Enable user links feature for pilot wikis, modern vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710344 (https://phabricator.wikimedia.org/T288274) [19:52:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:53:32] (03CR) 10Dave Pifke: "PCC output: https://puppet-compiler.wmflabs.org/compiler1003/30500/" [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [19:59:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:02:02] (03PS1) 1020after4: Support deprecated Content::preSaveTransform override [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710102 (https://phabricator.wikimedia.org/T288191) [20:02:08] (03CR) 10Nray: [C: 03+1] Enable user links feature for pilot wikis, modern vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710344 (https://phabricator.wikimedia.org/T288274) (owner: 10Clare Ming) [20:10:18] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:12] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:01] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:44] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:18:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [20:20:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:20:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) The only remaining on most of these is the idrac setup, This will happen tomorrow (Friday 6 AUG) [20:23:16] (03PS1) 10Papaul: Add ms-be206[2345] to DHCP file and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710352 (https://phabricator.wikimedia.org/T285809) [20:23:36] !log 1.37.0-wmf.17 promoted to all wikis. no new errors or concerning rates (T281158). fixes for open UBN T288191 will be handled via backport (see task discussion) [20:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:44] T281158: 1.37.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T281158 [20:23:44] T288191: substituting {{#tag:ref}} tags and templates; and pipe tricks fail in Page namespace - https://phabricator.wikimedia.org/T288191 [20:24:28] (03CR) 10Papaul: [C: 03+2] Add ms-be206[2345] to DHCP file and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710352 (https://phabricator.wikimedia.org/T285809) (owner: 10Papaul) [20:29:20] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` ms-be2062.codfw.wmnet ` The log can be found in `/var/l... [20:33:14] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [20:33:50] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 12.71 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [20:37:32] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:37:40] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.86 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [20:39:02] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.04208 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [20:44:37] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2062.codfw.wmnet with reason: REIMAGE [20:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:16] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 14.38 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [20:48:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:48:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Cmjohnson) [20:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: TBD) rack/setup/install dumpsdata100[45] - https://phabricator.wikimedia.org/T283290 (10Cmjohnson) these need idrac setups and should be completed by early next week (week of 9 AUG) [20:48:48] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2062.codfw.wmnet with reason: REIMAGE [20:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:43] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:11] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:38] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.58 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [21:02:58] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:11:13] (03PS1) 10Ssingh: Add doh5002 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/710358 (https://phabricator.wikimedia.org/T283503) [21:12:05] (03PS1) 10Ssingh: site: switch doh5002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/710360 [21:13:18] (03PS2) 10Ssingh: Add doh5002 to BGP anycast in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/710358 (https://phabricator.wikimedia.org/T283503) [21:16:20] (03PS1) 10Jdlrobson: Add visualClear style to MonoBook [skins/MonoBook] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710105 (https://phabricator.wikimedia.org/T288288) [21:26:48] (03PS2) 10Jforrester: Add visualClear style to MonoBook [skins/MonoBook] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710105 (https://phabricator.wikimedia.org/T288288) (owner: 10Jdlrobson) [21:28:07] (03CR) 10Jforrester: [C: 03+2] Add visualClear style to MonoBook [skins/MonoBook] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710105 (https://phabricator.wikimedia.org/T288288) (owner: 10Jdlrobson) [21:33:16] (03Merged) 10jenkins-bot: Add visualClear style to MonoBook [skins/MonoBook] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710105 (https://phabricator.wikimedia.org/T288288) (owner: 10Jdlrobson) [21:34:25] Jdlrobson: Live on mwdebug2001. [21:34:54] LGTM; OK to sync? [21:37:51] (03PS1) 10Papaul: Remove role insetup for ms-be206[2345] since there is already a role for ms-be* nodes [puppet] - 10https://gerrit.wikimedia.org/r/710362 (https://phabricator.wikimedia.org/T285809) [21:38:23] (03CR) 10jerkins-bot: [V: 04-1] Remove role insetup for ms-be206[2345] since there is already a role for ms-be* nodes [puppet] - 10https://gerrit.wikimedia.org/r/710362 (https://phabricator.wikimedia.org/T285809) (owner: 10Papaul) [21:38:25] James_F: 1s.. [21:39:07] James_F: yep good to go! [21:40:02] Syncing now. [21:40:25] cheers! [21:40:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:22] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.17/skins/MonoBook/resources/screen-common.less: T288288 Restore visualClear style to MonoBook so that footer doesn't show in the interwiki list (duration: 01m 24s) [21:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:29] T288288: 1.37.0-wmf.17 deployment moved footer from bottom of the page to left column in Monobook - https://phabricator.wikimedia.org/T288288 [21:42:00] (03PS2) 10Papaul: Remove role insetup for ms-be206[2345] form site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710362 (https://phabricator.wikimedia.org/T285809) [21:42:04] Jdlrobson: Should we mark the task as Resolved or does it need to go through Web team processes? [21:42:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:27] (03CR) 10Ryan Kemper: [C: 03+2] analytics web: create htdocs subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/709822 (owner: 10Ryan Kemper) [21:43:49] (03PS3) 10Papaul: Remove role insetup for ms-be206[2345] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710362 (https://phabricator.wikimedia.org/T285809) [21:44:40] (03CR) 10Papaul: [C: 03+2] Remove role insetup for ms-be206[2345] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710362 (https://phabricator.wikimedia.org/T285809) (owner: 10Papaul) [21:46:08] (03CR) 10BPirkle: [C: 03+2] Support deprecated Content::preSaveTransform override [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710102 (https://phabricator.wikimedia.org/T288191) (owner: 1020after4) [21:47:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:47:29] bpirkle: Did you mean to merge in the production branch? Are you emergency-deploying now? [21:50:53] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:51:47] RECOVERY - Persistent high iowait on labstore1004 is OK: (C)10 ge (W)5 ge 3.301 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [21:51:55] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:54:29] (03CR) 10BryanDavis: [C: 03+1] production-m5.sql.erb: Add toolhub grants [puppet] - 10https://gerrit.wikimedia.org/r/709877 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [21:55:36] 10SRE, 10serviceops: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy) [21:57:09] 10SRE, 10docker-pkg, 10serviceops: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy) [21:57:35] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:58:19] 10SRE, 10docker-pkg, 10serviceops: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy) [21:59:53] RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 1828675 MB (23% inode=78%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [22:02:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2062.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2062.codfw.wmnet'] ` [22:03:10] (03Merged) 10jenkins-bot: Support deprecated Content::preSaveTransform override [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710102 (https://phabricator.wikimedia.org/T288191) (owner: 1020after4) [22:03:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` ms-be2062.codfw.wmnet ` The log can be found in `/var/l... [22:06:03] PROBLEM - Host ms-be2062 is DOWN: PING CRITICAL - Packet loss = 100% [22:07:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:07:41] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [22:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:59] RECOVERY - Host ms-be2062 is UP: PING OK - Packet loss = 0%, RTA = 30.16 ms [22:10:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:17] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.17/includes/content/ContentHandler.php: T288191: Support deprecated Content::preSaveTransform override (1/2) (duration: 01m 00s) [22:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:24] T288191: substituting {{#tag:ref}} tags and templates; and pipe tricks fail in Page namespace - https://phabricator.wikimedia.org/T288191 [22:11:31] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.009586 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [22:12:26] !log jforrester@deploy1002 Synchronized php-1.37.0-wmf.17/includes/content/: T288191: Support deprecated Content::preSaveTransform override (2/2) (duration: 00m 55s) [22:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:57] (03CR) 10Cwhite: [C: 03+1] logstash: extend ssd tier retention from 15 to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/710341 (https://phabricator.wikimedia.org/T287938) (owner: 10Herron) [22:24:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:25:54] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:27:54] (03PS1) 10Legoktm: Revert "Use CsrfTokenSet as CSRF token source" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710368 (https://phabricator.wikimedia.org/T287542) [22:27:56] (03CR) 10Legoktm: [C: 03+2] Revert "Use CsrfTokenSet as CSRF token source" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710368 (https://phabricator.wikimedia.org/T287542) (owner: 10Legoktm) [22:28:27] (03PS1) 10Legoktm: Revert "Use CsrfTokenSet as CSRF token source" [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710369 (https://phabricator.wikimedia.org/T287542) [22:28:29] (03CR) 10Legoktm: [C: 03+2] Revert "Use CsrfTokenSet as CSRF token source" [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710369 (https://phabricator.wikimedia.org/T287542) (owner: 10Legoktm) [22:30:18] (03PS1) 10Ryan Kemper: analytics-web: temporary rsync module for thorium [puppet] - 10https://gerrit.wikimedia.org/r/710371 (https://phabricator.wikimedia.org/T285355) [22:31:49] (03CR) 10Razzi: [C: 03+1] analytics-web: temporary rsync module for thorium [puppet] - 10https://gerrit.wikimedia.org/r/710371 (https://phabricator.wikimedia.org/T285355) (owner: 10Ryan Kemper) [22:32:43] (03PS2) 10Ahmon Dancy: New image: php7.2-fpm-multiversion-base [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710294 (https://phabricator.wikimedia.org/T285309) [22:33:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:33:28] (03CR) 10Ahmon Dancy: New image: php7.2-fpm-multiversion-base (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/710294 (https://phabricator.wikimedia.org/T285309) (owner: 10Ahmon Dancy) [22:34:20] (03CR) 10Ryan Kemper: [C: 03+2] analytics-web: temporary rsync module for thorium [puppet] - 10https://gerrit.wikimedia.org/r/710371 (https://phabricator.wikimedia.org/T285355) (owner: 10Ryan Kemper) [22:34:41] 10SRE, 10docker-pkg, 10serviceops: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy) [22:38:20] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:38:21] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) [22:39:04] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:40:30] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:40:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2062.codfw.wmnet'] ` Of which those **FAILED**: ` ['ms-be2062.codfw.wmnet'] ` [22:41:19] (03CR) 10Cwhite: [C: 03+1] sre: add alerting cluster puppet fail [alerts] - 10https://gerrit.wikimedia.org/r/710248 (https://phabricator.wikimedia.org/T283151) (owner: 10Filippo Giunchedi) [22:41:26] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) 05Open→03Resolved a:03Legoktm I'm going to close this as resolved as I believe everythin... [22:41:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` ms-be2062.codfw.wmnet ` The log can be found in `/var/l... [22:42:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:46:52] (03Merged) 10jenkins-bot: Revert "Use CsrfTokenSet as CSRF token source" [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/710368 (https://phabricator.wikimedia.org/T287542) (owner: 10Legoktm) [22:47:36] (03CR) 10jerkins-bot: [V: 04-1] Revert "Use CsrfTokenSet as CSRF token source" [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710369 (https://phabricator.wikimedia.org/T287542) (owner: 10Legoktm) [22:49:05] (03PS2) 10Legoktm: Revert "Use CsrfTokenSet as CSRF token source" [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710369 (https://phabricator.wikimedia.org/T287542) [22:49:09] (03CR) 10Legoktm: [C: 03+2] Revert "Use CsrfTokenSet as CSRF token source" [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710369 (https://phabricator.wikimedia.org/T287542) (owner: 10Legoktm) [22:51:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:51:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:10] heh, okay, apparently you can't sync the entire php- directory anymore [22:52:25] CalledProcessError: Command 'find -O2 '/srv/mediawiki-staging/php-1.37.0-wmf.16/' -not -type d -name '*.php' -not -name 'autoload_static.php' -or -name '*.inc' | xargs -n1 -P30 -exec php -l >/dev/null 2>&1' returned non-zero exit status 124 [22:52:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:53:36] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.16/includes/: Revert "Use CsrfTokenSet as CSRF token source" (T287542) (duration: 01m 02s) [22:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:08] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [22:57:21] looks spurious ^ [22:57:58] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:58:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2062.codfw.wmnet with reason: REIMAGE [22:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] brennen: Dear deployers, time to do the US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210805T2300). [23:00:05] EricGardner: A patch you scheduled for US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2062.codfw.wmnet with reason: REIMAGE [23:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:10] I'm here for the deployment window when ready [23:01:20] brennen, EricGardner: fyi I have one core revert backport going through jenkins right now, I was hoping to have been done by now, sorry [23:01:33] legoktm: no worries, I'll stand by [23:02:30] EricGardner: we're all doing deployment training: were you planning to do the backport? [23:02:42] legoktm: thanks, let us know. here with xSavitar, thcipriani and cjming for deployment training [23:03:14] I can join the call in that case, I just wanted to backport a patch during today's window if there is time [23:07:56] PROBLEM - Host ms-be2062 is DOWN: PING CRITICAL - Packet loss = 100% [23:08:20] RECOVERY - Host ms-be2062 is UP: PING OK - Packet loss = 0%, RTA = 30.05 ms [23:10:49] (03Merged) 10jenkins-bot: Revert "Use CsrfTokenSet as CSRF token source" [core] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710369 (https://phabricator.wikimedia.org/T287542) (owner: 10Legoktm) [23:15:11] tested on mwdebug2001, syncing now [23:15:26] it's possible there might be spurious errors because of sync order [23:16:05] !log legoktm@deploy1002 Synchronized php-1.37.0-wmf.17/includes/: Revert "Use CsrfTokenSet as CSRF token source" (T287542) (duration: 01m 03s) [23:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:14] (03PS1) 10Eric Gardner: Revert "Open search result links in-place" [extensions/MediaSearch] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710387 [23:16:15] or not. :D [23:16:50] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:16:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:17:45] EricGardner, brennen: I'm all done, thanks for waiting [23:17:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:18:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:18:44] legoktm: <3 sorry CI slowed you down :) [23:19:13] nah, it actually caught a bug [23:19:28] then I'm glad CI slowed you down [23:19:31] :P [23:20:35] hehe [23:20:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2062.codfw.wmnet'] ` and were **ALL** successful. [23:21:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` ms-be2063.codfw.wmnet ` The log can be found in `/var/l... [23:34:19] (03CR) 10Eric Gardner: [C: 03+2] Revert "Open search result links in-place" [extensions/MediaSearch] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710387 (owner: 10Eric Gardner) [23:37:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2063.codfw.wmnet with reason: REIMAGE [23:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be2063.codfw.wmnet with reason: REIMAGE [23:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:54] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.523e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [23:46:24] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.003628 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [23:55:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be2063.codfw.wmnet'] ` and were **ALL** successful. [23:56:51] (03Merged) 10jenkins-bot: Revert "Open search result links in-place" [extensions/MediaSearch] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/710387 (owner: 10Eric Gardner) [23:56:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: ASAP) rack/setup/install ms-be20[62-65] - https://phabricator.wikimedia.org/T285809 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` ms-be2064.codfw.wmnet ` The log can be found in `/var/l... [23:57:39] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down