[00:00:07] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210701T0000). [00:02:00] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:03:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:09:40] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:24:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:26:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:32:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:34:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:34:46] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:36:02] 10SRE, 10Traffic: Image load failing with 429 from varnish - https://phabricator.wikimedia.org/T285875 (10Legoktm) p:05Triage→03Unbreak! In general we've made some changes recently to rate limiting after repeated abuse/DDoS attacks. Could you please clarify what software (e.g. Firefox, Chrome, some other t... [00:36:42] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:46:20] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:48:14] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:53:44] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:55:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:57:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:57:24] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:03:55] 10SRE, 10Traffic: Image load failing with 429 from varnish - https://phabricator.wikimedia.org/T285875 (10RoySmith) > Could you please clarify what software (e.g. Firefox, Chrome, some other tool) you're using to access pages/images that is returning 429s? I'm not sure who that was intended for, but I get sim... [01:10:03] 10SRE, 10Traffic: Image load failing with 429 from varnish - https://phabricator.wikimedia.org/T285875 (10Legoktm) >>! In T285875#7188943, @RoySmith wrote: >> Could you please clarify what software (e.g. Firefox, Chrome, some other tool) you're using to access pages/images that is returning 429s? > > I'm not... [01:11:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:13:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:15:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:17:48] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:22:51] 10SRE, 10Traffic: Image load failing with 429 from varnish - https://phabricator.wikimedia.org/T285875 (10RoySmith) If you could generate the URLs for the other size images, I'd be happy to give them a try from here. [01:23:40] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:25:36] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:04:40] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:06:30] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:26:21] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10Papaul) @Dzahn @jijiki @Joe I received the disk today, I will be replacing it tomorrow Thursday at 10:00am CT. If you need to do anything on this server before I replace the disk please let me know or you can just de-poo... [02:35:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:37:56] 10SRE, 10Thumbor: Image load failing with 429 from varnish - https://phabricator.wikimedia.org/T285875 (10Legoktm) p:05Unbreak!→03High @colewhite and I dug into this, it appears to be an issue with Thumbor: {P16748} Looking at https://logstash.wikimedia.org/goto/b74fa1ac65d1c96d08666f798a7f1fad we found... [02:39:26] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:41:45] 10SRE, 10serviceops, 10Patch-For-Review: Delay spinner showing for graphs for 1s - https://phabricator.wikimedia.org/T256641 (10Seddon) a:05Seddon→03None [02:45:13] 10SRE, 10Thumbor: Image load failing with 429 from varnish - https://phabricator.wikimedia.org/T285875 (10Legoktm) T226318#5282215 suggests that the 429 vs 500 may be a red herring in that thumbor will refuse to re-render a file it failed to render previously given that it's not going to make a difference. [02:45:21] 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10Legoktm) [02:54:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:04:25] PROBLEM - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:04:57] Bleh [03:06:15] RECOVERY - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:06:35] still looking [03:09:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:15:41] all the errors are like "Error: Could not parse CSS stylesheet" [03:21:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:26:56] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:27:42] the zotero spikes seem normal, I'm not looking anymore unless it pages again [03:29:00] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:32:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:32:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:36:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:37:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:41:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:50:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:52:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:00:04] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:01:55] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:23:58] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:25:48] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:27:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:29:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:29:40] (03PS2) 10ArielGlenn: Cleanup old mediainfo dumps [puppet] - 10https://gerrit.wikimedia.org/r/702413 (https://phabricator.wikimedia.org/T273266) (owner: 10Matthias Mullie) [04:35:07] (03CR) 10ArielGlenn: [C: 03+2] Cleanup old mediainfo dumps [puppet] - 10https://gerrit.wikimedia.org/r/702413 (https://phabricator.wikimedia.org/T273266) (owner: 10Matthias Mullie) [04:48:59] !log Disconnect eqiad -> codfw replication from s1-s8 [04:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:59:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:59:34] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:03:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:05:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:11:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:22:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Set core sections to unidir replication. [puppet] - 10https://gerrit.wikimedia.org/r/702255 (owner: 10Marostegui) [05:25:12] (03PS1) 10Marostegui: db1122,db1129,db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/702505 [05:25:56] (03CR) 10Marostegui: [C: 03+2] db1122,db1129,db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/702505 (owner: 10Marostegui) [05:27:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P16749 and previous config saved to /var/cache/conftool/dbconfig/20210701-052702-marostegui.json [05:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:30:44] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:40:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:50:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1129', diff saved to https://phabricator.wikimedia.org/P16750 and previous config saved to /var/cache/conftool/dbconfig/20210701-055243-marostegui.json [05:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:52] !log Deploy schema change on s6 eqiad master (db1173) T277123 [05:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:00] T277123: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 [05:57:34] !log Deploy schema change on s5 eqiad master (db1130) T277123 [05:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:49] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:58:53] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:59:03] the flapping of OSPF --^ is related to Lumen IIUC right? [05:59:15] if it keeps going we may need to downtime it, very spammy [06:02:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:03:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:03:43] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:06:23] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:15:21] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:18:47] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:12] looking [06:24:22] yep, lumen, opening a ticket [06:26:47] NEEDS ATTENTION [06:26:47] Initial service diagnostics have detected that current optical light levels are outside manufacturer recommendations at 957 STATION RD, BELLPORT, NY. [06:26:47] Recommended range is Min: -6 dBm; Max: -1 dBm. Measured light level is -55 dBm. [06:26:47] Next Steps: Please open a Repair Ticket for review by a Lumen technician. [06:27:23] reminds me of https://img.ifunny.co/images/fefe8365fc0b2abc23aa87fe276113f1cd7c4fdcc6f430322c3fe76d5704a240_1.jpg [06:29:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:51] !log Deploy schema change on s2,s8 eqiad masters T277123 [06:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:01] T277123: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 [06:34:42] !log Deploy schema change on s7 eqiad (db1136) masters T277123 [06:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:23] (03PS1) 10Marostegui: db1110,db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/702578 [06:57:50] (03CR) 10Marostegui: [C: 03+2] db1110,db1180: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/702578 (owner: 10Marostegui) [06:59:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:07] 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Legoktm) a:03Legoktm I've tried to summarize a combination of what I did and the feedback here into https://wikitech.wikimedia.org/wiki/Switch_Dat... [07:06:03] !log Deploy schema change on s4 eqiad (db1138) master T277123 [07:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:11] T277123: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 [07:07:22] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 3 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Samwilson) The 1.36 release notes say that "Command::execute() now returns a Shellbox\Command\UnboxedResult instead of a MediaWiki\Shell\Result.... [07:08:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:44] 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Joe) After talking off-phabricator with a few people, I think what we have seen is more of a failure of coordination between affected SRE teams than... [07:14:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:15:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:22:47] (03PS1) 10Muehlenhoff: elastic: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702580 (https://phabricator.wikimedia.org/T164456) [07:25:59] (03CR) 10DCausse: "> Patch Set 22: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [07:27:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702580 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [07:29:51] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:30:33] (03CR) 10JMeybohm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [07:32:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:05] 10SRE, 10Traffic, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10ayounsi) p:05Medium→03High [08:04:26] (03CR) 10Filippo Giunchedi: "Code LGTM although please publish/rebase the patch against the upstream-21.4.0 branch which is what is deployed (yes it is confusing, mast" [software/librenms] - 10https://gerrit.wikimedia.org/r/702438 (https://phabricator.wikimedia.org/T229542) (owner: 10Cathal Mooney) [08:06:58] (03CR) 10Ayounsi: "Thanks!" (034 comments) [software/librenms] - 10https://gerrit.wikimedia.org/r/702438 (https://phabricator.wikimedia.org/T229542) (owner: 10Cathal Mooney) [08:09:26] 10SRE, 10Infrastructure-Foundations, 10SRE-tools: Broken disk on thanos-be1003 but not reported / task not opened - https://phabricator.wikimedia.org/T285662 (10fgiunchedi) Thank you for the context, now I also recall a similar failure mode where we were wishing to have the number of expected disks! Indeed I... [08:11:26] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet [08:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:07] !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1261.eqiad.wmnet [08:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:18] ^ depools are us, working together in a session on how to decom old eqiad hardware [08:19:12] (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/697605 (owner: 10Ottomata) [08:22:03] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw126[2-6].eqiad.wmnet [08:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:07] !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw126[2-6].eqiad.wmnet [08:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:42] (03CR) 10Effie Mouzeli: tegola-vector-tiles: add helmfile.d config (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/701138 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [08:27:52] (03CR) 10Effie Mouzeli: tegola-vector-tiles: add caching support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/701369 (owner: 10Jgiannelos) [08:28:00] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: add caching support [deployment-charts] - 10https://gerrit.wikimedia.org/r/701369 (owner: 10Jgiannelos) [08:28:33] !log jelto@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1261.eqiad.wmnet [08:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:25] (03Merged) 10jenkins-bot: tegola-vector-tiles: add caching support [deployment-charts] - 10https://gerrit.wikimedia.org/r/701369 (owner: 10Jgiannelos) [08:31:03] (03CR) 10Ayounsi: [C: 04-1] "Indeed! I'm also trying to think on how to not "hard-code" interface names." (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) (owner: 10Cathal Mooney) [08:31:39] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:50] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [08:36:41] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/701138 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [08:36:56] (03PS13) 10Effie Mouzeli: tegola-vector-tiles: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/701138 (https://phabricator.wikimedia.org/T283159) [08:46:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I see jbond already made most of the comments I had ready, so they've been amended. LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [08:49:22] (03PS1) 10Effie Mouzeli: hieradata: enable TLS on memcached eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/702590 (https://phabricator.wikimedia.org/T271967) [08:50:57] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1261.eqiad.wmnet [08:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:07] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw1261.eqiad.wmnet` - m... [08:52:41] !log Deploy schema change on s1 eqiad (db1163) master T277123 [08:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:52] T277123: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 [08:53:33] (03PS1) 10Effie Mouzeli: hieradata: replace mcrouter proxies in with eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/702592 (https://phabricator.wikimedia.org/T271967) [08:53:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:54:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:55:54] 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10ema) >>! In T285875#7188988, @Legoktm wrote: > @colewhite and I dug into this, it appears to be an issue with... [08:57:41] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:01:11] 10SRE, 10Traffic: Preserve Server response header when generating custom error page via VCL - https://phabricator.wikimedia.org/T285926 (10ema) [09:04:20] (03PS1) 10Filippo Giunchedi: Report subprocess stdout/stderr as strings [alerts] - 10https://gerrit.wikimedia.org/r/702593 [09:05:08] !log Deploy schema change on s1 eqiad (db1157) master T277123 [09:05:10] (03PS1) 10Lucas Werkmeister (WMDE): Stop using legacy entityNamespaces setting in onSetupAfterCache hook [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702400 (https://phabricator.wikimedia.org/T285472) [09:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:16] T277123: Extend iwlinks.iwl_prefix to VARBINARY(32) - https://phabricator.wikimedia.org/T277123 [09:05:57] PROBLEM - mediawiki-installation DSH group on mw1265 is CRITICAL: Host mw1265 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:07:24] (03PS2) 10Filippo Giunchedi: Report subprocess stdout/stderr as strings [alerts] - 10https://gerrit.wikimedia.org/r/702593 [09:12:16] (03CR) 10David Caro: puppet.refresh_certs: don't fail if resources changed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [09:12:41] PROBLEM - mediawiki-installation DSH group on mw1264 is CRITICAL: Host mw1264 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:13:05] (03CR) 10David Caro: puppet.refresh_certs: don't fail if resources changed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [09:13:47] (03CR) 10David Caro: "This requires https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/701876" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/702082 (https://phabricator.wikimedia.org/T274498) (owner: 10David Caro) [09:19:58] (03CR) 10Jforrester: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702421 (https://phabricator.wikimedia.org/T260297) (owner: 10Legoktm) [09:26:33] PROBLEM - mediawiki-installation DSH group on mw1263 is CRITICAL: Host mw1263 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:28:21] (03PS1) 10Filippo Giunchedi: prometheus: don't deploy alerts to 'global' instance by default [puppet] - 10https://gerrit.wikimedia.org/r/702599 (https://phabricator.wikimedia.org/T284810) [09:28:53] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 6:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/701938 (owner: 10Jgiannelos) [09:30:25] PROBLEM - mediawiki-installation DSH group on mw1262 is CRITICAL: Host mw1262 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:30:28] (03Abandoned) 10Cathal Mooney: Modified version of LibreNMS Prometheus.php to add prefix [software/librenms] - 10https://gerrit.wikimedia.org/r/702438 (https://phabricator.wikimedia.org/T229542) (owner: 10Cathal Mooney) [09:30:33] PROBLEM - mediawiki-installation DSH group on mw1266 is CRITICAL: Host mw1266 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:31:08] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30064/console" [puppet] - 10https://gerrit.wikimedia.org/r/702599 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [09:35:34] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:35:35] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:01] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:36:04] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:36] (03PS4) 10JMeybohm: dragonfly: Add dragonfly supernode and client (dfdaemon) modules [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T264209) [09:38:01] (03CR) 10JMeybohm: dragonfly: Add dragonfly supernode and client (dfdaemon) modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701530 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [09:41:34] 10SRE, 10Machine-Learning-Team, 10serviceops: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) [09:41:56] 10SRE, 10Machine-Learning-Team, 10serviceops: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) [09:42:46] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10JMeybohm) [09:44:41] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10JMeybohm) [09:47:12] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:14] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10JMeybohm) I don't like the idea of having another way of how calico-node is run (it's already complex enough). Because of that I'll sugg... [09:55:07] (03CR) 10Hnowlan: [C: 03+2] maps: Switch buster nodes to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702114 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:55:33] !log start of clean up of autoreview logs in ruwiki (T285608) [09:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:41] T285608: Stop logging and clean up auto review logs - https://phabricator.wikimedia.org/T285608 [09:56:02] !log installing remaining gnutls28 security updates [09:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:42] (03PS1) 10David Caro: ceph.keyring: Add requirement for the ceph-common package [puppet] - 10https://gerrit.wikimedia.org/r/702602 [09:57:08] (03PS2) 10David Caro: ceph.keyring: Add requirement for the ceph-common package [puppet] - 10https://gerrit.wikimedia.org/r/702602 [09:58:35] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) Definitely, it seems a good way to proceed. The only concern that I have is that our kube masters are lightweight VMs (1 virtual... [09:58:50] (03CR) 10David Caro: [C: 03+2] ceph.keyring: Add requirement for the ceph-common package [puppet] - 10https://gerrit.wikimedia.org/r/702602 (owner: 10David Caro) [09:59:59] marostegui: buckle up, it's 40M rows being deleted from ruwiki [10:00:04] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210701T1000). [10:05:29] !log installing remaining libgcrypt20 security updates [10:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:26] (03CR) 10Hnowlan: [C: 03+2] Switch remaining (stretch) maps hosts to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702347 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:11:44] 40 million?? [10:11:44] (03PS4) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [10:12:02] oooohhh wow thank goodness [10:12:25] (03CR) 10jerkins-bot: [V: 04-1] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [10:13:47] (03CR) 10Ayounsi: Move RPKI alerts to Prometheus/AM (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [10:14:04] Amir1: sweeeet [10:16:22] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: fix values [deployment-charts] - 10https://gerrit.wikimedia.org/r/702604 [10:19:00] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10JMeybohm) Yeah, maybe. Calico-node runs with a memory limit of 400Mi and CPU requests if 350m but the other components will also take up... [10:19:35] (03CR) 10Filippo Giunchedi: [C: 03+2] Report subprocess stdout/stderr as strings [alerts] - 10https://gerrit.wikimedia.org/r/702593 (owner: 10Filippo Giunchedi) [10:20:29] (03PS5) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [10:20:36] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: fix values [deployment-charts] - 10https://gerrit.wikimedia.org/r/702604 (owner: 10Effie Mouzeli) [10:21:11] (03CR) 10jerkins-bot: [V: 04-1] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [10:21:49] (03PS1) 10Filippo Giunchedi: Ship a minimal README.md [alerts] - 10https://gerrit.wikimedia.org/r/702606 [10:22:04] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Ship a minimal README.md [alerts] - 10https://gerrit.wikimedia.org/r/702606 (owner: 10Filippo Giunchedi) [10:22:19] Amir1: then he will realise that he will never take this space back [10:22:26] and cry himself to sleep [10:22:27] :p [10:23:16] (03Merged) 10jenkins-bot: tegola-vector-tiles: fix values [deployment-charts] - 10https://gerrit.wikimedia.org/r/702604 (owner: 10Effie Mouzeli) [10:24:34] effie: honestly, the main problem is that these changes are so massive that if I go too slow, it'll take a month to finish, if I go too fast, it'll bring down replication. This is okay (around 20GB), the image table in commons is around 300GB, that'll take a good month at least [10:25:18] (03PS5) 10Muehlenhoff: Don't show Kerberos ticket info in general [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) [10:25:20] I don't know why rows read in s6 is high, is the query not very optimized :/ [10:25:46] we know who we're gonna call either way [10:25:47] :p [10:26:40] (03CR) 10jerkins-bot: [V: 04-1] Don't show Kerberos ticket info in general [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [10:27:07] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [10:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [10:31:25] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:33:11] (03CR) 10Muehlenhoff: "Updated the patch, the CI failure is some unrelated breakage." [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [10:33:48] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Another data point, as expected post-switchover the high latency uploads from jobrunners moved from codfw to eqiad since codfw is now active. [10:35:09] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Also to avoid confusion I'd like to clarify that on the swift side I can't find anything obviously wrong though I don't have the bandwidth to investiga... [10:35:58] 10SRE, 10Infrastructure-Foundations, 10SRE-tools: Broken disk on thanos-be1003 but not reported / task not opened - https://phabricator.wikimedia.org/T285662 (10Volans) p:05Triage→03Medium Ack, let's keep it around for now to explore what options we have. [10:52:55] (03PS6) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [10:53:38] (03CR) 10jerkins-bot: [V: 04-1] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [10:55:23] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: fix values for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/702609 [10:57:07] (03PS3) 10Zabe: Avoid using MWNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697851 [10:57:09] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: fix values for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/702609 (owner: 10Effie Mouzeli) [10:58:33] (03PS1) 10Gergő Tisza: Welcome tour: Mark as complete when notice is shown [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702401 (https://phabricator.wikimedia.org/T284800) [10:59:09] (03PS7) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [10:59:51] (03PS1) 10Gergő Tisza: Welcome tour: Mark as complete when notice is shown [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702402 (https://phabricator.wikimedia.org/T284800) [10:59:57] (03CR) 10jerkins-bot: [V: 04-1] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [11:00:05] Amir1, Lucas_WMDE, apergos, and duesen: It is that lovely time of the day again! You are hereby commanded to deploy EU Backport and Config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210701T1100). [11:00:05] Lucas_WMDE and zabe: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: fix values for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/702609 (owner: 10Effie Mouzeli) [11:00:10] o/ [11:00:14] hello [11:00:19] o/ [11:00:21] no one is signed up for the EU training slot. [11:00:29] we will have someone present for the US slot though! [11:00:31] was just about to ask, thanks [11:00:33] oh cool! [11:00:35] there is only one patch in the window [11:00:37] I have a sticker if you need one [11:00:40] and you know who :-P [11:00:50] I see two patches [11:00:55] wut [11:01:00] who snuck one in last minute? [11:01:08] I did [11:01:09] zabe did! [11:01:15] looking at it now [11:01:17] oh three now! [11:01:21] might do this before my own backport to speed it up [11:01:29] four! [11:01:38] ok, it's nice to have a tiny warning so we can spend some time to read these and make sure they are ok to go [11:01:40] (four people, three patches) [11:01:44] just saying :-P [11:01:49] (the other way around) [11:02:04] we didn't want you to feel neglected apergos [11:02:27] thanks soooo much :-P [11:02:32] oh, I already reviewed this config patch a month ago :D [11:02:33] these all look pretty straight forward [11:02:41] rebasing and deploying the MWNamespace one [11:02:41] as far as deployment goes. [11:02:52] anyway I can do the backports, they will take a while to get through CI though [11:02:53] are we all self serve here or who's doing what? [11:02:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Avoid using MWNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697851 (owner: 10Zabe) [11:03:00] (03Merged) 10jenkins-bot: tegola-vector-tiles: fix values for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/702609 (owner: 10Effie Mouzeli) [11:03:05] ok tgr you got those [11:03:20] (03PS1) 10Hnowlan: maps: make maps2010 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) [11:03:36] zabe: are you self serve or woudl you like someone to deploy for you? [11:03:46] (03PS8) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [11:04:03] (03Merged) 10jenkins-bot: Avoid using MWNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697851 (owner: 10Zabe) [11:04:20] oic. I guess Lucas is doing yours then :-D [11:04:26] I can't self serve, I don't have access. But it looks like Lucas is already doing it. [11:04:32] yeah I can do it [11:04:41] okey dokey! [11:04:41] (looked at the deployers list in puppet and didn’t see a zabe there) [11:04:49] I might I might even sneak in two more [11:04:50] (03CR) 10jerkins-bot: [V: 04-1] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [11:05:18] hm, I just pulled the change to mwdebug1001 [11:05:22] but we probably want to test on codfw? [11:05:29] ah sighi [11:05:32] I think I had an outdated version of my `deploy` script that opens all the terminals [11:06:23] (03PS2) 10Hnowlan: maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 [11:06:25] but does the extension have any setting for codfw? [11:06:29] ok, now pulled to mwdebug2001 [11:07:20] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [11:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:28] * apergos sees that it does, oh well, shoulda looked before asking [11:08:07] looks okay to merge from my side [11:08:18] apergos: not sure what you mean by having settings for codfw… [11:08:21] zabe, can you test please? [11:08:27] Lucas_WMDE: for me everything does look like before (which should be the case), so I think we are good. [11:08:34] ok [11:08:36] oh just have the mwdebug hosts in the dropdown, Lucas_WMDE, and indeed it does [11:08:47] ah, the *browser* extension ^^ [11:08:51] I thought you meant Wikibase [11:08:52] yeah, sorry :-D [11:08:56] no no no! [11:09:10] syncing [11:09:38] (I also looked at the effective excludeNamespaces setting in shell.php and it looked fine, still including tons of odd numbers = talk namespaces) [11:10:12] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:697851|Avoid using MWNamespace]] (duration: 01m 06s) [11:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Welcome tour: Mark as complete when notice is shown [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702401 (https://phabricator.wikimedia.org/T284800) (owner: 10Gergő Tisza) [11:10:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Welcome tour: Mark as complete when notice is shown [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702402 (https://phabricator.wikimedia.org/T284800) (owner: 10Gergő Tisza) [11:10:47] someone's on a roll :-) [11:10:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Stop using legacy entityNamespaces setting in onSetupAfterCache hook [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702400 (https://phabricator.wikimedia.org/T285472) (owner: 10Lucas Werkmeister (WMDE)) [11:10:53] PROBLEM - Disk space on releases1002 is CRITICAL: DISK CRITICAL - free space: /srv/docker 5040 MB (3% inode=80%): /srv/docker/overlay2/e08bd827952d234ff75ff3917e9fb0f2e8bf6358f44d847a205186531165ca73/merged 5040 MB (3% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [11:10:55] just +2ed all the backports [11:11:04] I think it’s okay to deploy them in whatever order they merge in [11:11:17] Wikibase will probably slower than GrowthExperiments [11:11:38] tgr: do you want to deploy them when they’re merged or should I? [11:13:23] Lucas_WMDE: I can deploy them, I have two more coming (but I should probably wait until these are merged, otherwise Zuul will restart) [11:13:29] ok [11:14:05] I can just wait until you are done with the Wikibase one. There is no window after this so no rush. [11:14:38] sounds good [11:15:10] I think I’ll take my time testing the Wikibase change on mwdebug, but we can +2 your other backports as soon as the Wikibase change merges, that still gives me ca. 15 minutes ^^ [11:16:20] um a clarifying question, if we dpeloy them in a different order than they are merged, won't that means some rebase shuffling during this window ad then again for whoever gets the next patch after these? [11:16:34] oh it's scap. it's rsync, nm [11:16:42] Lucas_WMDE: tgr: you can just +2 all the changes and let CI process/merge them. BUT you will have to be careful when you fetch on the deployment server [11:17:12] rebase ~~ALL the changes~~ just some of the changes [11:17:35] although it might a bit of a mental burden to have patch merged but only fetch the one you want to deploy on the server [11:17:39] apergos: it looks like Zuul is enforcing that they’re merged in the same order that they were +2ed, and then we’ll deploy in that order too [11:17:46] right [11:17:50] I thought it might do them in parallel but they’re chained [11:17:52] more or less yes [11:17:56] convenient [11:17:58] assuming the jobs pass [11:18:22] if you +2 A then B then C they will be merged in that order cause they all depend on each other in the CI queue [11:18:29] but if A fails for some reason, it is dropped from the CI queue [11:18:29] hashar: right, but when Zuul is in the middle of testing a set of patches that have been +2-ed, and you +2 another one, won't it discard the process and start a new CI job for the new set of +2-ed patches? [11:18:43] and B and C have all the jobs cancelled and retriggered to no more take in account A [11:18:52] so you end up with B and C merged in that order but A left out [11:18:58] *nod* [11:19:09] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [11:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:23] tgr: it put the the +2ed patch on top of the existing queue [11:19:37] B is tested in CI as if A had been merged already [11:20:01] oh that's nice! [11:20:04] TIL [11:20:08] I see. Not much point in waiting then, I guess. [11:20:17] I should really really do a training about zuul again [11:20:30] it is definitely time for a refresher [11:20:33] I used to do those years ago but definitely missed that and there is a bit of confusion [11:20:34] yeah [11:20:46] plus lots of new folks since then [11:21:01] yeah I can't say I am any good at having folks onboarded :^D [11:21:05] :-D [11:21:12] anyway, the doc is https://zuul-ci.org/docs/zuul/discussion/gating.html [11:21:31] from upstream, which is like 2 major versions above the one we use but that doc still applies [11:21:36] (i wrote part of it) [11:22:01] 👀 [11:22:08] the bulk of the idea is when one +2 a change A done to mediawiki, then +2 a change B for Wikibase [11:22:10] 10SRE: Please add btullis@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T285936 (10BTullis) [11:22:42] when starting processing B, Zuul creates a merge commit of A gainst master for mediawiki and creates a git ref like refs/zuul/master/B [11:22:57] CI then clone Wikibase, fetches change B [11:23:12] and clone mediawiki/core then attempts to fetch refs/zuul/master/B which has the change A ahead in the queue [11:23:16] then run the tests [11:23:31] thus the jobs running for B run for code that contains the change A [11:23:36] nice! [11:23:48] and it does that for the whole chain of changes that are in the gate-and-submit queue [11:24:09] adding a Depends-On header in the commit message triggers the same logic [11:24:22] so one can test a change as if a change from another repo already got merged [11:24:33] 10SRE, 10Gerrit-Privilege-Requests: Grant Access to mediawiki gerrit group for divec - https://phabricator.wikimedia.org/T285931 (10Jdforrester-WMF) [11:24:35] but yeah point taken I should do a presentation [11:24:45] I've only added those headers in order to tell devs DON'T MERGE THIS YET [11:24:47] :-D [11:24:54] maybe to whole engineering folks [11:24:58] yeah [11:24:58] please do, I will be showing up! [11:25:11] so if your change B has a Depends-On: A and you +2 B [11:25:31] Zuul fetch the metadata for change B, notice it depends-on A, check whether A got merged or already had been +2 [11:25:35] and if not, it bails out [11:25:49] gonna have to start using that a lot more often [11:26:02] so essentially B is blocked until either A got merged or is ahead in the queue (ie A received a +2 before B got a +2) [11:26:25] it does not apply to operations/dns or operations/puppet though, they use a slightly different workflow [11:26:28] yeah, Depends-on is super cool [11:26:42] (there is no gate-and-submit , patches are just directly submitted bypassing ci) [11:27:07] yeah depends-on and the whole gating system ( maybe for a tech department update ) is the killer future of Zuul [11:27:22] gate-and-submit-wmf is nearing completion [11:27:37] I think I’ll try to `git fetch` on the deployment host after both GrowthExperiments changes merge, but before Wikibase [11:27:45] so then `scap pull` on mwdebug gets the fix on all wikis [11:27:50] for easy testin [11:27:52] *testing [11:27:57] (and then sync twice if it works) [11:28:30] I think that even if you pull both changes in mediawiki/core you can solely submodule update GrowthExperiments and deploy that [11:28:36] then submodule update Wikibase and deploy that [11:28:45] true, good point [11:28:48] should be fine unless you trigger a full sync in which case both repos will be updated [11:28:54] (also I forgot that I was going to let tgr deploy these, sorry ^^) [11:29:05] that is why I guess folks +2 , pull , submodule update, deploy [11:29:14] then +2 , pull etc [11:29:32] yes exactly [11:29:33] Lucas_WMDE: I can wait until you are done, I have two more backports to set up in the meanwhile [11:29:41] which is slow but gives a guarantee that the state on the deploy server is always fine / deployable [11:29:48] (please do add them to the deployment page for the record!) [11:29:51] ok then I’ll set up everything until mwdebug and let you test [11:29:55] +1 apergos [11:30:01] relying on not updating submodules is a good hack, but might be surprising if something goes south or someone else step in after [11:30:41] yeah you do not want to have to be thinking about anything extra if something is broken [11:30:46] and of course, making those tests dramatically faster would help. I did some investigation but have yet to write a problem statement [11:31:08] I wonder if some things can be split and run in parallel for a cheap speedup [11:31:24] so yeah +2 ing everything and relying on NOT updating the submodules is not documented, cause it is way too fragile [11:31:51] the tests we run are overkill, we simply run too many of them and some should only trigger for the repository they actually test [11:32:02] you can do `git submodule update extensions/Wikibase` to only update that, regardless of the patches merged. [11:32:19] but I don't think there's much drawback to updating all. [11:32:27] like we run the whole wikibase test suite for any repos participating in the wmf-quibble jobs (aka Vector, or Flow or CirrusSearch) [11:32:39] the GE patches are all frontend, they won't interfere with the test. [11:32:48] so in short, gotta split the tests so that they don't all trigger for any patches [11:33:13] !log jelto@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1262.eqiad.wmnet [11:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:34] (03PS1) 10Hnowlan: maps: reimage maps1010 as buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702619 (https://phabricator.wikimedia.org/T269582) [11:33:43] !log reboot ml-serve-ctrl100[1,2] to increase vcpus/memory (1->2 vcores, 2->4g of memory) [11:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:04] (03PS1) 10Gergő Tisza: SuggestedEdits: Return default JS data as 'noresults' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702403 (https://phabricator.wikimedia.org/T285906) [11:34:19] (03PS1) 10Gergő Tisza: SuggestedEdits: Return default JS data as 'noresults' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702404 (https://phabricator.wikimedia.org/T285906) [11:34:31] You can mark phpunit tests as @standalone and they'll run in commits on your repo but not on other repo's patches. [11:34:58] Unfortunately only Cirrus and Scribunto are using the tag so far. [11:35:10] oh I didn’t know that [11:35:12] Moving 90% of Wikibase's tests into @standalone would be so nice. [11:35:26] that sounds like something we should do at least for most of our proper unit tests [11:35:27] Lucas_WMDE: New as of ~ 15 months ago. [11:35:33] !log Deploy schema change on s8 eqiad master T276150 [11:35:41] (but probably not for our integration tests? sometimes we have legitimate issues due to core changes) [11:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:43] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [11:35:46] !log reboot ml-serve-ctrl200[1,2] to increase vcpus/memory (1->2 vcores, 2->4g of memory) [11:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:59] Lucas_WMDE: That'd be great, though right now the main issue is Wikibase's endless selenium tests. Speeding up one of the jobs but not the other won't make anything merge faster. [11:36:35] (03CR) 10Gergő Tisza: [C: 03+2] SuggestedEdits: Return default JS data as 'noresults' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702403 (https://phabricator.wikimedia.org/T285906) (owner: 10Gergő Tisza) [11:36:39] Lucas_WMDE: It depends on the nature of the integration test. Unit tests are ultra-fast anyway so there's no point improving that, really. [11:36:39] hm, I don’t find any @standalone with codesearch [11:36:40] (03PS2) 10Hnowlan: maps: make maps1008 a buster replica of maps1009 [puppet] - 10https://gerrit.wikimedia.org/r/702102 (https://phabricator.wikimedia.org/T269582) [11:36:42] (03CR) 10Gergő Tisza: [C: 03+2] SuggestedEdits: Return default JS data as 'noresults' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702404 (https://phabricator.wikimedia.org/T285906) (owner: 10Gergő Tisza) [11:37:43] ah, @group Standalone? [11:37:53] Sorry, yes. [11:38:01] PHPunit group, not phpdoc tag. [11:38:14] *files away for later* [11:38:21] PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:47] We reduced Scribunto's test suite from ~ 3 mins to ~ 5 seconds IIRC. [11:39:13] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:39:42] (03Merged) 10jenkins-bot: Welcome tour: Mark as complete when notice is shown [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702401 (https://phabricator.wikimedia.org/T284800) (owner: 10Gergő Tisza) [11:39:44] (03Merged) 10jenkins-bot: Welcome tour: Mark as complete when notice is shown [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702402 (https://phabricator.wikimedia.org/T284800) (owner: 10Gergő Tisza) [11:40:19] okay, fetched both those backports onto deploy1002 [11:41:07] tgr: okay, both GrowthExperiments backports should be on mwdebug2001 now [11:41:11] can you test them? [11:41:42] Lucas_WMDE: there are two more merging now, I'll wait for those [11:41:51] unless you need to do a sync-world? [11:42:09] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:42:18] 10SRE, 10decommission-hardware: decommission maps2002.codfw.wmnet - https://phabricator.wikimedia.org/T285938 (10MoritzMuehlenhoff) [11:42:20] no, but I’d still prefer to deploy these in the order they were merged… [11:42:28] I didn’t realize you wanted to wait deploying them [11:43:36] I don't think the order matters whatsoever for scap [11:45:43] maybe not for scap… [11:45:59] is mwmaint2002 the active maintenance server? [11:46:04] https://wikitech.wikimedia.org/wiki/Mwmaint2001 seems outdated [11:46:11] I think so… I’m on it, at least [11:46:25] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1262.eqiad.wmnet [11:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:35] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw1262.eqiad.wmnet` - m... [11:46:38] Yes, 1002 in eqiad and 2002 in codfw IIRC. [11:46:52] so should I pull, test, sync the Wikibase change? while keeping in mind that the GrowthExperiments changes are still outstanding? [11:47:47] (03Merged) 10jenkins-bot: Stop using legacy entityNamespaces setting in onSetupAfterCache hook [extensions/Wikibase] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702400 (https://phabricator.wikimedia.org/T285472) (owner: 10Lucas Werkmeister (WMDE)) [11:48:24] I can test the GE ones in a second. [11:48:26] Wikibase change pulled to mwdebug2001, testing [11:49:10] But yeah, you follow the exact same process. As long as you only sync the Wikibase directory, other patches won't matter. [11:49:29] (03PS1) 10Muehlenhoff: Remove Hiera settings for maps2002 [puppet] - 10https://gerrit.wikimedia.org/r/702624 (https://phabricator.wikimedia.org/T285938) [11:49:41] The worst that can happen is unrelated errors from the other patches, but in this case there's no risk of that. [11:50:50] (note that for the us window, where there is expected to be a trainee, it would be best to do the more simple deploy as merged process) [11:51:57] okay, I think the Wikibase change is working correctly. syncing [11:53:44] apergos: on one hand it's easier to follow, on the other hand "...and now we wait half an hour for the patch to merge" is not super engaging training [11:54:02] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/Wikibase/: Backport: [[gerrit:702400|Stop using legacy entityNamespaces setting in onSetupAfterCache hook (T285472)]] (duration: 01m 15s) [11:54:09] no but that's when you talk about all the things to keep in mind, review git commands together and so on :) [11:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:12] T285472: Remove entityNamespaces settings - https://phabricator.wikimedia.org/T285472 [11:55:22] (03CR) 10jerkins-bot: [V: 04-1] SuggestedEdits: Return default JS data as 'noresults' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702403 (https://phabricator.wikimedia.org/T285906) (owner: 10Gergő Tisza) [11:58:34] (03Merged) 10jenkins-bot: SuggestedEdits: Return default JS data as 'noresults' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702403 (https://phabricator.wikimedia.org/T285906) (owner: 10Gergő Tisza) [11:58:54] !log jelto@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1263.eqiad.wmnet [11:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:03] tgr: do you want to take over or should I pull those changes? [11:59:04] file_put_contents(/cache/composer/repo/https---repo.packagist.org/provider-wikimedia$textcat.json): failed to open stream: No space left on device [11:59:18] I can take over, thanks [11:59:24] ok [11:59:26] if CI is willing [11:59:29] (03Merged) 10jenkins-bot: SuggestedEdits: Return default JS data as 'noresults' [extensions/GrowthExperiments] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702404 (https://phabricator.wikimedia.org/T285906) (owner: 10Gergő Tisza) [12:00:34] jouncebot: now [12:00:34] No deployments scheduled for the next 3 hour(s) and 59 minute(s) [12:00:39] ok cool [12:00:45] (03CR) 10Gergő Tisza: [V: 03+2 C: 03+2] "Oops, meant to do that for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/702403/" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702403 (https://phabricator.wikimedia.org/T285906) (owner: 10Gergő Tisza) [12:00:50] oh right. window supposedly over. heh [12:01:26] (03PS1) 10Arturo Borrero Gonzalez: toolforge: install jobs-framework-cli [puppet] - 10https://gerrit.wikimedia.org/r/702639 [12:01:43] (03CR) 10Gergő Tisza: "...which is this patch. I'm confused, how did this even merge?" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702403 (https://phabricator.wikimedia.org/T285906) (owner: 10Gergő Tisza) [12:01:59] tgr: the CI failure was in the test build, not the gate-and-submit build [12:02:12] only gate-and-submit matters for merging [12:02:20] don't both need to succeed though? [12:02:27] not as far as I know [12:02:46] gate-and-submit can even complete and merge the change before the regular test build finishes running [12:02:47] (03CR) 10jerkins-bot: [V: 04-1] toolforge: install jobs-framework-cli [puppet] - 10https://gerrit.wikimedia.org/r/702639 (owner: 10Arturo Borrero Gonzalez) [12:03:22] (probably not very common, but I’ve seen it happen) [12:09:21] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1263.eqiad.wmnet [12:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:33] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw1263.eqiad.wmnet` - m... [12:12:09] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Overriding jenkins because the failure it reports is not related to this patch." [puppet] - 10https://gerrit.wikimedia.org/r/702639 (owner: 10Arturo Borrero Gonzalez) [12:15:37] (03PS1) 10Arturo Borrero Gonzalez: toolforge: bastion: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/702642 [12:16:17] going to wander off for awhile since the actual window is over and waiting for zuul is mind-numbing, as was pointed out earlier :-P [12:16:34] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: performance-asoranking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:55] are we waiting for zuul? I thought everything’s merged [12:17:07] (03CR) 10jerkins-bot: [V: 04-1] toolforge: bastion: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/702642 (owner: 10Arturo Borrero Gonzalez) [12:19:24] !log jelto@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1264-1265].eqiad.wmnet [12:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:19] Lucas_WMDE: no, it just took a while to test [12:20:30] ok, no problem [12:20:40] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/GrowthExperiments/: Backport: [[gerrit:702401|Welcome tour: Mark as complete when notice is shown (T284800)]] [[gerrit:702403|SuggestedEdits: Return default JS data as 'noresults' (T285906)]] (duration: 01m 09s) [12:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:50] T284800: Donors to newcomers: URL parameters - https://phabricator.wikimedia.org/T284800 [12:20:50] T285906: [wmf.12-regression] mobile - Suggested edits initial load is not functional - https://phabricator.wikimedia.org/T285906 [12:20:53] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "overriding jenkins, the error it reports is not related to this patch." [puppet] - 10https://gerrit.wikimedia.org/r/702642 (owner: 10Arturo Borrero Gonzalez) [12:22:44] (03PS9) 10Filippo Giunchedi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [12:22:56] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/GrowthExperiments/: Backport: [[gerrit:702402|Welcome tour: Mark as complete when notice is shown (T284800)]] [[gerrit:702404|SuggestedEdits: Return default JS data as 'noresults' (T285906)]] (duration: 01m 08s) [12:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:16] !log EU deploys done [12:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:26] (03CR) 10jerkins-bot: [V: 04-1] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [12:23:28] nice, thanks [12:26:55] (03CR) 10Vgutierrez: [C: 03+1] Switch ncredir to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/697799 (owner: 10Muehlenhoff) [12:27:23] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [12:29:58] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1264-1265].eqiad.wmnet [12:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:08] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw[1264-1265].eqiad.wmn... [12:37:44] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [12:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:37] the irc feed for recent changes on en.wikiquote.org is missing a lot of edits (page creation by spambots) though it does detect me deleting the pages. Is this the correct place to report this issue? [12:39:27] !log jelto@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1266.eqiad.wmnet [12:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:49] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1266.eqiad.wmnet [12:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:59] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: `mw1266.eqiad.wmnet` - m... [12:50:32] (03PS1) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [12:54:59] (03CR) 10Jelto: [V: 03+1] "I decommissioned mw126[1-6].eqiad.wmnet in Rack A5 using the sre.hosts.decomission cookbook." [puppet] - 10https://gerrit.wikimedia.org/r/679527 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [12:56:03] (03PS2) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [12:57:50] (03CR) 10Hnowlan: [C: 03+1] Remove Hiera settings for maps2002 [puppet] - 10https://gerrit.wikimedia.org/r/702624 (https://phabricator.wikimedia.org/T285938) (owner: 10Muehlenhoff) [13:00:55] (03PS1) 10Elukey: Add dummy tokens to ML server master nodes [labs/private] - 10https://gerrit.wikimedia.org/r/702646 [13:01:13] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add dummy tokens to ML server master nodes [labs/private] - 10https://gerrit.wikimedia.org/r/702646 (owner: 10Elukey) [13:02:18] (03PS3) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [13:02:59] !log Deploy schema change on s2 eqiad master T276150 [13:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:10] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [13:05:11] (03PS1) 10Ottomata: Require python3-pandas for performance asoranking [puppet] - 10https://gerrit.wikimedia.org/r/702647 (https://phabricator.wikimedia.org/T275786) [13:07:06] (03PS4) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [13:07:40] (03CR) 10jerkins-bot: [V: 04-1] ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [13:08:48] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:09:03] (03PS1) 10Ema: varnish: Server response header in custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/702648 (https://phabricator.wikimedia.org/T285926) [13:09:43] (03CR) 10Elukey: [C: 03+1] Require python3-pandas for performance asoranking [puppet] - 10https://gerrit.wikimedia.org/r/702647 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [13:10:09] (03CR) 10Ottomata: [C: 03+2] Require python3-pandas for performance asoranking [puppet] - 10https://gerrit.wikimedia.org/r/702647 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [13:11:46] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:15:48] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:13] (03PS5) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [13:17:33] (03CR) 10jerkins-bot: [V: 04-1] ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [13:18:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove Hiera settings for maps2002 [puppet] - 10https://gerrit.wikimedia.org/r/702624 (https://phabricator.wikimedia.org/T285938) (owner: 10Muehlenhoff) [13:18:10] (03PS1) 10David Caro: backy2: add missing ceph::common dependency to tests [puppet] - 10https://gerrit.wikimedia.org/r/702652 [13:18:12] (03PS1) 10David Caro: wmcs.ceph: remove unused backup role [puppet] - 10https://gerrit.wikimedia.org/r/702653 [13:19:06] dcaro: o/ from jenkins I see errors like "profile::wmcs::backy2 on debian-10-x86_64 is expected to compile into a catalogue without dependency cycles", something WIP? [13:20:52] (03PS6) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [13:21:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30072/console" [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [13:22:13] (03CR) 10jerkins-bot: [V: 04-1] ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [13:22:45] elukey: yep, got this to fix it https://gerrit.wikimedia.org/r/c/operations/puppet/+/702652 [13:23:00] not sure why it did not break when I first merged the previous patch [13:23:37] elukey: do you know if the jenkins job tries to be smart when running the puppet tests and skips some? [13:25:06] (03PS1) 10Muehlenhoff: Remove maps2002 from conftool [puppet] - 10https://gerrit.wikimedia.org/r/702654 (https://phabricator.wikimedia.org/T285938) [13:25:43] dcaro: thanks! No idea :( [13:28:53] (03CR) 10Elukey: [V: 03+1] "The jenkins failures should be separate, currently being fixed by wmcs :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [13:31:45] 10SRE, 10Traffic, 10Patch-For-Review: Preserve Server response header when generating custom error page via VCL - https://phabricator.wikimedia.org/T285926 (10ema) p:05Triage→03Medium [13:33:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove maps2002 from conftool [puppet] - 10https://gerrit.wikimedia.org/r/702654 (https://phabricator.wikimedia.org/T285938) (owner: 10Muehlenhoff) [13:33:55] (03PS1) 10David Caro: wmcs.ceph: Add the new 17, 19 and 20 OSDs [puppet] - 10https://gerrit.wikimedia.org/r/702655 (https://phabricator.wikimedia.org/T285858) [13:34:39] (03PS3) 10MSantos: maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 [13:35:33] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts maps2002.codfw.wmnet [13:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:54] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.ceph: Add the new 17, 19 and 20 OSDs [puppet] - 10https://gerrit.wikimedia.org/r/702655 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [13:36:00] (03CR) 10jerkins-bot: [V: 04-1] maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [13:37:19] (03PS4) 10MSantos: maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 [13:37:36] (03PS1) 10Giuseppe Lavagetto: mwdebug: bump mediawiki version [deployment-charts] - 10https://gerrit.wikimedia.org/r/702657 [13:38:43] (03CR) 10jerkins-bot: [V: 04-1] maps: fix osm sync directory path [puppet] - 10https://gerrit.wikimedia.org/r/701558 (owner: 10MSantos) [13:42:24] (03CR) 10Dzahn: [C: 03+2] "thanks! looks good. merging. will follow-up with a change for the yaml in hieradata/hosts" [puppet] - 10https://gerrit.wikimedia.org/r/679527 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [13:43:15] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10decommission-hardware, 10Patch-For-Review: decommission maps2002.codfw.wmnet - https://phabricator.wikimedia.org/T285938 (10MSantos) [13:44:03] (03PS1) 10Muehlenhoff: Remove DHCP record for maps2002 [puppet] - 10https://gerrit.wikimedia.org/r/702658 [13:45:09] (03PS1) 10Dzahn: remove hieradata/hosts files for former eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/702659 (https://phabricator.wikimedia.org/T280203) [13:48:38] (03CR) 10Jelto: [C: 03+1] "lgtm. We just have remember to recreate this files for the new canaries (and maybe merge them)" [puppet] - 10https://gerrit.wikimedia.org/r/702659 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [13:49:01] (03PS1) 10Jgiannelos: Make production images lighter [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/702661 [13:49:08] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: disable probes and enable debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/702662 [13:49:15] (03CR) 10Dzahn: [C: 03+2] remove hieradata/hosts files for former eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/702659 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [13:49:31] (03PS2) 10Jgiannelos: Reduce production image size [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/702661 [13:50:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts maps2002.codfw.wmnet [13:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:18] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10decommission-hardware, 10Patch-For-Review: decommission maps2002.codfw.wmnet - https://phabricator.wikimedia.org/T285938 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `maps2002.codfw.wmnet` - maps2002.c... [13:50:23] (03PS2) 10Muehlenhoff: Remove DHCP record for maps2002 [puppet] - 10https://gerrit.wikimedia.org/r/702658 [13:50:48] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [13:51:08] (03CR) 10Muehlenhoff: [C: 03+2] Remove DHCP record for maps2002 [puppet] - 10https://gerrit.wikimedia.org/r/702658 (owner: 10Muehlenhoff) [13:52:54] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: disable probes and enable debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/702662 (owner: 10Effie Mouzeli) [13:53:07] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: disable probes and enable debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/702662 (owner: 10Effie Mouzeli) [13:53:56] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10decommission-hardware, 10Patch-For-Review: decommission maps2002.codfw.wmnet - https://phabricator.wikimedia.org/T285938 (10MoritzMuehlenhoff) [13:54:10] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10decommission-hardware, 10Patch-For-Review: decommission maps2002.codfw.wmnet - https://phabricator.wikimedia.org/T285938 (10MoritzMuehlenhoff) a:03Papaul [13:54:43] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) @Jclark-ctr @wiki_willy The 6 servers at the bottom of rack A5 (mw1261 through mw1266) have been decomed and... [13:55:41] (03Merged) 10jenkins-bot: tegola-vector-tiles: disable probes and enable debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/702662 (owner: 10Effie Mouzeli) [13:55:56] (03PS2) 10Ema: varnish: Server response header in custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/702648 (https://phabricator.wikimedia.org/T285926) [13:57:34] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [13:59:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) In case it helps here, today we shut down 6 servers in A5 (T280203#7190053), you can replace those with new servers. [13:59:30] (03PS3) 10Ema: varnish: Server response header in custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/702648 (https://phabricator.wikimedia.org/T285926) [14:00:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: bump mediawiki version [deployment-charts] - 10https://gerrit.wikimedia.org/r/702657 (owner: 10Giuseppe Lavagetto) [14:00:44] 10SRE, 10serviceops: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [14:00:46] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:59] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: Add the new 17, 19 and 20 OSDs [puppet] - 10https://gerrit.wikimedia.org/r/702655 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [14:01:12] (03PS2) 10David Caro: wmcs.ceph: Add the new 17, 19 and 20 OSDs [puppet] - 10https://gerrit.wikimedia.org/r/702655 (https://phabricator.wikimedia.org/T285858) [14:01:21] (03PS10) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [14:01:47] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/702663 [14:02:54] (03Merged) 10jenkins-bot: mwdebug: bump mediawiki version [deployment-charts] - 10https://gerrit.wikimedia.org/r/702657 (owner: 10Giuseppe Lavagetto) [14:07:28] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/702663 (owner: 10Effie Mouzeli) [14:09:36] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:10:13] (03Merged) 10jenkins-bot: tegola-vector-tiles: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/702663 (owner: 10Effie Mouzeli) [14:12:34] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:12:38] 10SRE, 10Dumps-Generation: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10ArielGlenn) 05Stalled→03Resolved Hey this is now verified and we're closing. Thanks for your patience, everybody! [14:17:34] PROBLEM - Varnish frontend child restarted on cp3059 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3059&var-datasource=esams+prometheus/ops [14:17:42] (03CR) 10Muehlenhoff: [C: 03+2] Switch ncredir to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/697799 (owner: 10Muehlenhoff) [14:23:34] 10SRE, 10Traffic: cp3059 Varnish child crash: Worker Pool Queue does not move - https://phabricator.wikimedia.org/T285953 (10ema) [14:25:28] (03CR) 10Herron: [C: 03+1] Add btullis to the ops security group [puppet] - 10https://gerrit.wikimedia.org/r/702424 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [14:27:20] (03PS1) 10Muehlenhoff: Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) [14:31:57] (03CR) 10jerkins-bot: [V: 04-1] Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:33:23] (03PS2) 10Muehlenhoff: Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) [14:34:18] can someone from service ops please respond to https://phabricator.wikimedia.org/T285603 [14:35:33] (03PS1) 10Ema: varnish: do not set reason for 428, 429, 431 and 511 [puppet] - 10https://gerrit.wikimedia.org/r/702671 (https://phabricator.wikimedia.org/T285926) [14:39:53] (03PS2) 10Ema: varnish: do not set reason for 428, 429, 431 and 511 [puppet] - 10https://gerrit.wikimedia.org/r/702671 (https://phabricator.wikimedia.org/T285926) [14:41:43] 10SRE, 10Platform Engineering, 10SRE-Access-Requests, 10Patch-For-Review: Root access to AQS cluster - https://phabricator.wikimedia.org/T285899 (10herron) Looks reasonable to me, and thanks much for writing the patch! Typically group changes involving full root access are reviewed/approved during the SRE... [14:43:00] 10SRE, 10Platform Engineering, 10SRE-Access-Requests, 10Patch-For-Review: Root access to AQS cluster - https://phabricator.wikimedia.org/T285899 (10herron) p:05Triage→03Medium [14:43:15] 10SRE, 10Traffic: cp3059 Varnish child crash: Worker Pool Queue does not move - https://phabricator.wikimedia.org/T285953 (10ema) Relevant upstream issues: - https://github.com/varnishcache/varnish-cache/issues/2814 - https://github.com/varnishcache/varnish-cache/issues/2862 Related patch to look into: https... [14:43:28] (03PS7) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [14:44:50] 10SRE, 10Traffic: cp3059 Varnish child crash: Worker Pool Queue does not move - https://phabricator.wikimedia.org/T285953 (10ema) p:05Triage→03Medium [14:44:51] (03CR) 10jerkins-bot: [V: 04-1] ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [14:45:10] !log installing glib2.0 security updates on buster [14:45:14] (03PS2) 10Herron: add fgoodwin (uid=frankie) to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/702439 (https://phabricator.wikimedia.org/T285580) [14:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:36] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:57] (03CR) 10Vgutierrez: [C: 03+1] varnish: do not set reason for 428, 429, 431 and 511 [puppet] - 10https://gerrit.wikimedia.org/r/702671 (https://phabricator.wikimedia.org/T285926) (owner: 10Ema) [14:51:14] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:24] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:46] (03CR) 10Herron: [C: 03+2] add fgoodwin (uid=frankie) to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/702439 (https://phabricator.wikimedia.org/T285580) (owner: 10Herron) [14:53:06] !log depool mw2380 for disk repair - T285603 [14:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:14] T285603: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 [14:55:09] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10jijiki) @Papaul sorry for the delay, the server can be turned off any time [14:56:01] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10Papaul) Thank you. [14:57:37] !log jiji@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2380.codfw.wmnet [14:57:38] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for fgoodwin - https://phabricator.wikimedia.org/T285580 (10herron) 05Open→03Resolved Hi @FGoodwin, your ldap account has been added to group `wmf`. I'll transition this to resolved now, but please don't hesitate to reopen if an... [14:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:54] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [14:58:39] !log poweroff mw2380 for disk replacement [14:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:02] (03PS2) 10Elukey: backy2: add missing ceph::common dependency to tests [puppet] - 10https://gerrit.wikimedia.org/r/702652 (owner: 10David Caro) [15:01:06] PROBLEM - Host mw2380 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:54] (03PS1) 10David Caro: ceph.keyring: ensure that the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702677 [15:02:14] (03CR) 10Elukey: "Just rebasing to see if jenkins is happy with this change. In case, can we merge to smooth out a bit the current puppet validation checks?" [puppet] - 10https://gerrit.wikimedia.org/r/702652 (owner: 10David Caro) [15:02:30] (03PS2) 10David Caro: ceph.keyring: ensure that the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) [15:02:34] (03CR) 10jerkins-bot: [V: 04-1] ceph.keyring: ensure that the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [15:03:07] (03CR) 10jerkins-bot: [V: 04-1] ceph.keyring: ensure that the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [15:03:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks fine, but the creation of a new access group will need discussion/signoff in next SRE meeting (12th of July)." [puppet] - 10https://gerrit.wikimedia.org/r/702452 (https://phabricator.wikimedia.org/T285899) (owner: 10Eevans) [15:04:04] (03CR) 10David Caro: [C: 03+2] backy2: add missing ceph::common dependency to tests [puppet] - 10https://gerrit.wikimedia.org/r/702652 (owner: 10David Caro) [15:04:36] (03CR) 10Filippo Giunchedi: [C: 03+1] "Some nits inline but overall LGTM, feel free to merge (alerts will auto-deploy)" (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [15:05:46] (03CR) 10David Caro: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/702652 (owner: 10David Caro) [15:06:04] (03PS3) 10David Caro: ceph.keyring: ensure that the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) [15:06:43] (03CR) 10jerkins-bot: [V: 04-1] ceph.keyring: ensure that the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [15:07:01] (03PS4) 10David Caro: ceph.keyring: ensure that the bootstrap dir exists [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) [15:07:02] RECOVERY - Host mw2380 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [15:08:08] elukey: merged the test fix, let me know if it helps [15:08:37] <3 [15:09:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [15:09:32] 10SRE, 10ops-codfw: Degraded RAID on mw2380 - https://phabricator.wikimedia.org/T285603 (10Papaul) 05Open→03Resolved Disk replaced. Please go ahead and re-image the server. thanks [15:09:34] PROBLEM - PHP7 jobrunner on mw2380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:10:17] (03PS8) 10Elukey: ml_k8s::master: add profile::kubernetes::node [puppet] - 10https://gerrit.wikimedia.org/r/702645 (https://phabricator.wikimedia.org/T285927) [15:10:46] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:44] PROBLEM - PHP7 rendering on mw2380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [15:12:38] PROBLEM - SSH on mw2380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:15:08] PROBLEM - Host mw2380 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:53] (03CR) 10Bstorm: "Are there cases where the path does not have a directory to create? I only ask because this seems more clever than explicit. It might be a" [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [15:19:46] RECOVERY - Device not healthy -SMART- on mw2380 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw2380&var-datasource=codfw+prometheus/ops [15:25:08] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10decommission-hardware, 10Patch-For-Review: decommission maps2002.codfw.wmnet - https://phabricator.wikimedia.org/T285938 (10Papaul) [15:25:50] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10decommission-hardware, 10Patch-For-Review: decommission maps2002.codfw.wmnet - https://phabricator.wikimedia.org/T285938 (10Papaul) 05Open→03Resolved complete [15:32:49] (03CR) 10Ema: [C: 03+2] varnish: do not set reason for 428, 429, 431 and 511 [puppet] - 10https://gerrit.wikimedia.org/r/702671 (https://phabricator.wikimedia.org/T285926) (owner: 10Ema) [15:33:07] (03PS2) 10Jdlrobson: Use Vue.js for QuickSurveys on available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702434 (https://phabricator.wikimedia.org/T285890) [15:37:50] (03CR) 10Ayounsi: "Thanks!" (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [15:38:04] (03PS11) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [15:38:47] (03CR) 10jerkins-bot: [V: 04-1] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [15:47:59] (03CR) 10MSantos: [C: 03+2] Unify production server and pregeneration images [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701529 (owner: 10Jgiannelos) [15:48:14] (03CR) 10Filippo Giunchedi: Move RPKI alerts to Prometheus/AM (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [15:49:11] (03PS12) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [15:49:16] (03Merged) 10jenkins-bot: Unify production server and pregeneration images [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701529 (owner: 10Jgiannelos) [15:49:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:49:43] (03CR) 10Ayounsi: Move RPKI alerts to Prometheus/AM (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [15:50:45] (03CR) 10jerkins-bot: [V: 04-1] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [15:51:07] RECOVERY - Check systemd state on ml-serve-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:51:35] (03PS13) 10Ayounsi: Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) [15:52:31] (03PS3) 10Hnowlan: maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 [15:52:33] (03PS2) 10Hnowlan: maps: make maps2010 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) [15:52:35] (03PS1) 10Hnowlan: maps: standardise the maps2.0 config in codfw, remove old nodes [puppet] - 10https://gerrit.wikimedia.org/r/702687 (https://phabricator.wikimedia.org/T269582) [15:53:29] (03CR) 10Ayounsi: [C: 03+2] Move RPKI alerts to Prometheus/AM [alerts] - 10https://gerrit.wikimedia.org/r/700649 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [15:54:59] (03PS1) 10Reedy: Revert "Replace depricating method IContextSource::getWikiPage && IContextSource::canUseWikiPage" [extensions/ConfirmEdit] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702707 [15:55:50] (03PS4) 10Hnowlan: maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 [15:56:27] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Cmjohnson) Ticket opened with Dell [15:56:53] (03PS2) 10Reedy: Revert "Replace depricating method IContextSource::getWikiPage && IContextSource::canUseWikiPage" [extensions/ConfirmEdit] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702707 (https://phabricator.wikimedia.org/T285959) [15:57:25] (03PS5) 10Hnowlan: maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 [15:57:27] (03PS3) 10Hnowlan: maps: make maps2010 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) [15:57:35] (03CR) 10Reedy: [C: 03+2] Revert "Replace depricating method IContextSource::getWikiPage && IContextSource::canUseWikiPage" [extensions/ConfirmEdit] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702707 (https://phabricator.wikimedia.org/T285959) (owner: 10Reedy) [15:58:23] (03PS1) 10Ayounsi: Remove old RPKI Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/702688 (https://phabricator.wikimedia.org/T282806) [15:58:57] (03PS1) 10Razzi: Make analytics-hive temporarily point to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/702689 [16:00:04] jbond42 and cdanis: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210701T1600). [16:01:24] (03CR) 10Elukey: [C: 03+1] Make analytics-hive temporarily point to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/702689 (owner: 10Razzi) [16:02:04] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove old RPKI Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/702688 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [16:02:19] (03CR) 10Ayounsi: [C: 03+2] Remove old RPKI Grafana alerts [puppet] - 10https://gerrit.wikimedia.org/r/702688 (https://phabricator.wikimedia.org/T282806) (owner: 10Ayounsi) [16:05:50] (03CR) 10David Caro: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [16:06:21] (03PS2) 10Razzi: Make analytics-hive temporarily point to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/702689 [16:06:25] (03CR) 10Volans: puppet.refresh_certs: don't fail if resources changed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [16:06:31] (03PS6) 10Hnowlan: maps: reimage maps2008 as buster replica in new cluster [puppet] - 10https://gerrit.wikimedia.org/r/702099 [16:07:52] 10SRE, 10ops-eqiad, 10User-fgiunchedi: Disk failed on thanos-be1003 - https://phabricator.wikimedia.org/T285664 (10Cmjohnson) A ticket has been created with Dell You have successfully submitted request SR1063937753. [16:09:46] (03CR) 10David Caro: puppet.refresh_certs: don't fail if resources changed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [16:11:16] (03CR) 10Volans: puppet.refresh_certs: don't fail if resources changed (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [16:11:54] !log restart varnish-fe on cp3059 - T285953 [16:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:04] T285953: cp3059 Varnish child crash: Worker Pool Queue does not move - https://phabricator.wikimedia.org/T285953 [16:13:36] (03CR) 10Razzi: [C: 03+2] Make analytics-hive temporarily point to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/702689 (owner: 10Razzi) [16:14:25] RECOVERY - Varnish frontend child restarted on cp3059 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3059&var-datasource=esams+prometheus/ops [16:15:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:18:57] (03CR) 10Btullis: [C: 03+2] Add btullis to the ops security group [puppet] - 10https://gerrit.wikimedia.org/r/702424 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [16:19:18] (03PS5) 10Btullis: Add btullis to the ops security group [puppet] - 10https://gerrit.wikimedia.org/r/702424 (https://phabricator.wikimedia.org/T285754) [16:19:24] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add btullis to the ops security group [puppet] - 10https://gerrit.wikimedia.org/r/702424 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [16:20:02] !jouncebot now [16:20:02] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [16:20:08] jouncebot now [16:20:08] For the next 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210701T1600) [16:20:24] (03Merged) 10jenkins-bot: Revert "Replace depricating method IContextSource::getWikiPage && IContextSource::canUseWikiPage" [extensions/ConfirmEdit] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702707 (https://phabricator.wikimedia.org/T285959) (owner: 10Reedy) [16:23:01] !log reedy@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/ConfirmEdit/SimpleCaptcha/SimpleCaptcha.php: T285959 (duration: 01m 20s) [16:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:08] T285959: Captcha interface is not shown to unregistered users, page save is not possible - https://phabricator.wikimedia.org/T285959 [16:23:42] Reedy: all clear? thinking of rolling back to group0 for T285951, per Krinkle. [16:23:43] T285951: Some section links in search results are redlinks - https://phabricator.wikimedia.org/T285951 [16:23:56] Yup [16:25:06] cool, thx. [16:25:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:27:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:27:57] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:56] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.37.0-wmf.12" [16:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:13] (03PS1) 10Brennen Bearnes: Revert "group1 wikis to 1.37.0-wmf.12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702694 [16:30:15] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group1 wikis to 1.37.0-wmf.12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702694 (owner: 10Brennen Bearnes) [16:30:55] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.37.0-wmf.12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702694 (owner: 10Brennen Bearnes) [16:33:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:47] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:00:05] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210701T1700). [17:00:27] (03PS1) 10Bartosz Dziewoński: EventDispatcher: Ensure we fetch page content from the primary database [extensions/DiscussionTools] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702708 (https://phabricator.wikimedia.org/T285895) [17:00:36] (03PS1) 10Bartosz Dziewoński: EventDispatcher: Ensure we fetch page content from the primary database [extensions/DiscussionTools] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702709 (https://phabricator.wikimedia.org/T285895) [17:13:45] 10SRE, 10serviceops, 10Patch-For-Review: Delay spinner showing for graphs for 1s - https://phabricator.wikimedia.org/T256641 (10herron) p:05Triage→03Medium [17:15:35] 10SRE, 10observability: mtail testing infrastructure does not report Runtime errors - https://phabricator.wikimedia.org/T285533 (10herron) p:05Triage→03Medium [17:16:02] 10SRE, 10observability, 10good first task: mtail testing infrastructure prints python deprecation warnings - https://phabricator.wikimedia.org/T285534 (10herron) p:05Triage→03Medium [17:17:36] 10SRE, 10observability, 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10herron) p:05Triage→03High [17:18:36] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10herron) p:05Triage→03Medium [17:18:44] (03CR) 10Ayounsi: "This change is ready for review." [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (owner: 10Ayounsi) [17:19:41] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10aaron... [17:20:32] (03PS4) 10Ayounsi: Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) [17:22:06] (03CR) 10Majavah: Port labs-in4/6 to Capirca (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [17:23:40] (03PS1) 10Bstorm: cloud nfs: set up cloudstore1009 for DRBD [puppet] - 10https://gerrit.wikimedia.org/r/702701 (https://phabricator.wikimedia.org/T224747) [17:26:26] 10SRE: Please add btullis@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T285936 (10herron) Hi @BTullis, sure, I've just added you to analytics-alerts and you should be receiving these emails now. For analytics-announce, a subscription request via https://lists.wikimedia.... [17:27:11] 10SRE: Please add btullis@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T285936 (10herron) p:05Triage→03Medium [17:28:00] 10SRE, 10Gerrit-Privilege-Requests: Grant Access to mediawiki gerrit group for divec - https://phabricator.wikimedia.org/T285931 (10herron) p:05Triage→03Medium [17:29:09] 10SRE, 10SRE-OnFire, 10observability: Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10herron) p:05Triage→03Medium [17:29:41] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10herron) p:05Triage→03Medium [17:32:29] 10SRE, 10Gerrit-Privilege-Requests: Grant Access to mediawiki gerrit group for divec - https://phabricator.wikimedia.org/T285931 (10Legoktm) @Jdforrester-WMF access to the "mediawiki" group is handled by #mediawiki-gerrit-group-requests per 10SRE, 10Wikimedia-Mailing-lists: Redirect https://lists.wikimedia.org/pipermail/foobar/ to https://lists.wikimedia.org/hyperkitty/list/foobar@lists.wikimedia.org/ - https://phabricator.wikimedia.org/T285949 (10herron) p:05Triage→03Medium [17:34:57] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:41] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:53] 10SRE, 10MediaWiki-Gerrit-Group-Requests: Grant Access to mediawiki gerrit group for divec - https://phabricator.wikimedia.org/T285931 (10Jdforrester-WMF) [17:42:47] (03CR) 10Bstorm: [C: 03+2] cloud nfs: set up cloudstore1009 for DRBD [puppet] - 10https://gerrit.wikimedia.org/r/702701 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [17:48:31] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10BTullis) Thanks. I can confirm that I've now been able to access puppetmasters and other servers requiring `ops` group membership. One thing that doesn't... [17:52:18] (03PS1) 10Ahmon Dancy: collect both version and tag from wikiversions output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 [17:55:04] (03PS5) 10Ayounsi: Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) [17:55:23] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10BTullis) Also LibreNMS and Logstash authentication don't seem to let me in. Neither is urgent, just thought I'd let you know in case there is anything els... [17:56:15] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210701T1800). [18:00:05] Jdlrobson and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] hiii [18:00:15] I can deploy today [18:00:26] Jdlrobson: around? [18:01:05] (03CR) 10Urbanecm: [C: 03+2] EventDispatcher: Ensure we fetch page content from the primary database [extensions/DiscussionTools] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702708 (https://phabricator.wikimedia.org/T285895) (owner: 10Bartosz Dziewoński) [18:01:14] (03CR) 10Urbanecm: [C: 03+2] EventDispatcher: Ensure we fetch page content from the primary database [extensions/DiscussionTools] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702709 (https://phabricator.wikimedia.org/T285895) (owner: 10Bartosz Dziewoński) [18:02:28] MatmaRex: i'll ping you once it's ready to be tested [18:02:46] thanks [18:03:24] urbanecm: i can test the happy path, but the real verification will be in whether the exceptions stop (and no new ones appear in their place) [18:03:56] ack. The point of the test in this case is whether it's not _worse_, i guess [18:04:44] yeah, just in case we all somehow missed a typo or something [18:04:56] i found the "mediawiki-new-errors" logstash dashboard and i'll watch that afterwards [18:05:55] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, as discussed the install and cloudcontrol terms could be left out (they'd hit the default allow as the IPs don't match private4 or p" [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [18:07:32] (03Merged) 10jenkins-bot: EventDispatcher: Ensure we fetch page content from the primary database [extensions/DiscussionTools] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702708 (https://phabricator.wikimedia.org/T285895) (owner: 10Bartosz Dziewoński) [18:07:35] (03Merged) 10jenkins-bot: EventDispatcher: Ensure we fetch page content from the primary database [extensions/DiscussionTools] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702709 (https://phabricator.wikimedia.org/T285895) (owner: 10Bartosz Dziewoński) [18:09:15] MatmaRex: excellent [18:09:40] Jdlrobson: ping? [18:10:42] MatmaRex: pulled to mwdebug2001, please have a look. [18:10:53] yup [18:13:18] urbanecm: seems good, i got a notification for this comment: https://test2.wikipedia.org/wiki/Talk:Main_Page#c-Matma_Rex_test_2021-07-01-2021-07-01T18%3A12%3A00.000Z-Matma_Rex-2021-07-01T18%3A11%3A00.000Z [18:14:02] MatmaRex: i see `Expectation (writes <=) 0 by MediaWiki::restInPeace not met (actual: 2): query-m: DELETE FROM `echo_unread_wikis` WHERE euw_user = N AND euw_wiki = 'X'` but that's probably caused by Echo, right? [18:14:44] hmm, yeah, looks unrelated [18:15:18] Let me just check if it appears in logs before, and if it does, i'll sync [18:15:38] urbanecm: looks like this bug: https://phabricator.wikimedia.org/T219592 [18:16:44] yeah, happens quite a lot [18:16:46] syncing :) [18:17:37] (03CR) 10Zfilipin: [C: 03+1] "@Mukunda Modell: feel free to merge!" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [18:18:44] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.11/extensions/DiscussionTools/includes/Notifications/EventDispatcher.php: 6d9043087ec421e1321cd6797934928e2651b1c1: EventDispatcher: Ensure we fetch page content from the primary database (T285895) (duration: 01m 14s) [18:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:54] T285895: ApiUsageException: There is no revision with ID [REDACTED]. - https://phabricator.wikimedia.org/T285895 [18:20:17] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/DiscussionTools/includes/Notifications/EventDispatcher.php: 654877f92fa18ae766d693630025c69400cad3ac: EventDispatcher: Ensure we fetch page content from the primary database (T285895) (duration: 01m 12s) [18:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:27] MatmaRex: here you go. Anything else I can help with? [18:20:41] thanks. hopefully nothing else :D [18:21:04] great :) [18:21:26] and by the way, thanks for the toolset your team creates. I like them a lot. [18:22:05] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: hive-server2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:07] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:23:41] :D [18:23:59] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:01] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:36:45] urbanecm: hey present now [18:37:05] hey Jdlrobson [18:37:22] can you test it in prod somehow? [18:37:40] yep [18:37:42] if you enable it [18:37:44] great [18:37:46] (03CR) 10Urbanecm: [C: 03+2] Use Vue.js for QuickSurveys on available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702434 (https://phabricator.wikimedia.org/T285890) (owner: 10Jdlrobson) [18:37:52] enable what? the config patch? [18:38:18] yep that one [18:38:27] good [18:38:31] (03Merged) 10jenkins-bot: Use Vue.js for QuickSurveys on available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702434 (https://phabricator.wikimedia.org/T285890) (owner: 10Jdlrobson) [18:39:03] Jdlrobson: available at mwdebug2001, please have a look [18:39:13] looking [18:41:02] urbanecm: feel free to sync! [18:41:21] syncing [18:42:14] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T285877 (10sbassett) @herron - thanks, confirmed it's working. I'll make this task public now. [18:42:26] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T285877 (10sbassett) [18:43:03] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7995f7abe3b94eb0326064cbbd1d3111f8f21365: Use Vue.js for QuickSurveys on available wikis (T285890) (duration: 01m 09s) [18:43:09] Jdlrobson: should be live! [18:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:13] T285890: Remove OOUI surveys and default to Vue.js - https://phabricator.wikimedia.org/T285890 [18:43:13] anything else i can help with? [18:43:25] thanks urbanecm nothing else I Need [18:43:32] good! [18:44:02] sorry again for the lateness. Obviously need to check my IRC notification settings as something is going wrong there.. [18:44:57] or maybe your browser just blocks irccloud notifications? [18:45:05] (according to your whois, you're on irccloud) [18:49:20] 10SRE: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10RobH) [18:49:32] 10SRE: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10RobH) [18:49:34] 10SRE, 10DC-Ops: documented procedure for replacing disks in software RAID servers - https://phabricator.wikimedia.org/T220842 (10RobH) 05Resolved→03Open [18:49:36] 10SRE, 10DC-Ops: documented procedure for replacing disks in software RAID servers - https://phabricator.wikimedia.org/T220842 (10RobH) 05Open→03Resolved This is now documented on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions The responsibility of it belonging to the ser... [18:50:02] !log otto@deploy1002 Started deploy [analytics/refinery@7dea883] (hadoop-test): Deploying to analytics-test cluster for testing gobblin [analytics/refinery@7dea883] [18:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:35] (03PS1) 10Ahmon Dancy: Trigger update-train-versions job at end of wmf-publish pipeline [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702710 [18:50:47] (03CR) 10Ahmon Dancy: [C: 03+2] Trigger update-train-versions job at end of wmf-publish pipeline [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702710 (owner: 10Ahmon Dancy) [18:50:59] (03PS1) 10Razzi: Point analytics-hive to an-coord1001.eqiad.wmnet once again [dns] - 10https://gerrit.wikimedia.org/r/702731 [18:51:24] 10SRE, 10DC-Ops: documented procedure for replacing disks in software RAID servers - https://phabricator.wikimedia.org/T220842 (10RobH) a:05RobH→03None [18:51:37] (03PS2) 10Legoktm: mysql_legacy: Re-add x2 and properly support active/active sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) [18:53:06] (03PS3) 10Legoktm: mysql_legacy: Re-add x2 and properly support active/active sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) [18:53:38] (03Abandoned) 10Legoktm: sre.switchdc.mediawiki: Handle x2 specially [cookbooks] - 10https://gerrit.wikimedia.org/r/701475 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [18:54:27] (03CR) 10Razzi: [C: 03+2] Point analytics-hive to an-coord1001.eqiad.wmnet once again [dns] - 10https://gerrit.wikimedia.org/r/702731 (owner: 10Razzi) [18:55:22] !log otto@deploy1002 Finished deploy [analytics/refinery@7dea883] (hadoop-test): Deploying to analytics-test cluster for testing gobblin [analytics/refinery@7dea883] (duration: 05m 19s) [18:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:50] (03PS1) 10Brennen Bearnes: Consistently normalize Title::mFragment before setting [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702711 (https://phabricator.wikimedia.org/T285951) [18:57:56] (03PS1) 10Andrew Bogott: Added a dummy password for profile::openstack::eqiad1::ldap_user_pass [labs/private] - 10https://gerrit.wikimedia.org/r/702732 [18:58:26] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added a dummy password for profile::openstack::eqiad1::ldap_user_pass [labs/private] - 10https://gerrit.wikimedia.org/r/702732 (owner: 10Andrew Bogott) [18:58:35] (03CR) 10Ppchelko: [C: 03+1] Consistently normalize Title::mFragment before setting [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702711 (https://phabricator.wikimedia.org/T285951) (owner: 10Brennen Bearnes) [18:58:41] (03PS1) 10Bstorm: cloud nfs: commit solidly to the drbd setup step 1 [puppet] - 10https://gerrit.wikimedia.org/r/702733 (https://phabricator.wikimedia.org/T224747) [18:59:01] (03CR) 10jerkins-bot: [V: 04-1] mysql_legacy: Re-add x2 and properly support active/active sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [18:59:07] Pchelolo: i'll go ahead and deploy the above [18:59:15] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) Most of the spicerack confusion and trouble is that x2 matches `A:db-core` even though it's more like parsercache. If it didn't match that... [18:59:39] (03CR) 10Brennen Bearnes: [C: 03+2] Consistently normalize Title::mFragment before setting [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702711 (https://phabricator.wikimedia.org/T285951) (owner: 10Brennen Bearnes) [18:59:52] brennen: hopefully it will work [19:00:04] brennen and marxarelli: How many deployers does it take to do MediaWiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210701T1900). [19:00:06] testable on mwdebug? [19:00:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:00:40] (03CR) 10Legoktm: "Not sure what to do about "mccabe: MC0001 / MysqlLegacy.get_core_dbs is too complex (11)"" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [19:02:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:04:43] (03PS1) 10Andrew Bogott: Added a dummy password for profile::openstack::eqiad1::ldap_user_pass again [labs/private] - 10https://gerrit.wikimedia.org/r/702735 [19:04:50] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added a dummy password for profile::openstack::eqiad1::ldap_user_pass again [labs/private] - 10https://gerrit.wikimedia.org/r/702735 (owner: 10Andrew Bogott) [19:05:50] if this is reported more broadly, it may be a wmf.12 deployment blocker: https://phabricator.wikimedia.org/T285966 - some pages were displaying with missing styles on group1 wikis [19:06:23] can anyone have a look and try to reproduce? i reproduced it once, but it fixed itself after refreshing the page. [19:07:19] RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [19:07:38] (03CR) 10Bstorm: [C: 03+2] "PCC is correct https://puppet-compiler.wmflabs.org/compiler1002/30081/" [puppet] - 10https://gerrit.wikimedia.org/r/702733 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [19:07:54] MatmaRex: hmm - i do see some weirdness on commons [19:08:06] but noting that wmf.12 isn't on group1 at the moment [19:08:27] oh. heh [19:08:45] well, probably not a blocker then. i'm silly for not checking [19:09:12] it did make it to group1 earlier; rolled back around... hrm, 16:28 [19:09:16] (UTC) [19:09:37] (03Merged) 10jenkins-bot: Trigger update-train-versions job at end of wmf-publish pipeline [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/702710 (owner: 10Ahmon Dancy) [19:10:05] (03PS2) 10Andrew Bogott: toolforge: add a profile for installing the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701928 (https://phabricator.wikimedia.org/T170355) [19:14:53] (03PS3) 10Andrew Bogott: toolforge: add a profile for installing the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701928 (https://phabricator.wikimedia.org/T170355) [19:16:42] (03PS1) 10Bstorm: cloud nfs: cleaning up the non-drbd setup [puppet] - 10https://gerrit.wikimedia.org/r/702738 (https://phabricator.wikimedia.org/T224747) [19:16:56] (03PS1) 10Btullis: Grant icinga permissions to btullis [puppet] - 10https://gerrit.wikimedia.org/r/702739 (https://phabricator.wikimedia.org/T285754) [19:17:55] (03PS2) 10Btullis: Grant icinga permissions to btullis [puppet] - 10https://gerrit.wikimedia.org/r/702739 (https://phabricator.wikimedia.org/T285754) [19:18:04] brennen: could it be a cache issue then? [19:18:16] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: add a profile for installing the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701928 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:18:17] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.12/.pipeline/config.yaml: Backport: [[gerrit:702168|Trigger update-train-versions job at end of wmf-publish pipeline]] (duration: 01m 08s) [19:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:35] 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10wkandek) Thanks everybody for the feedback on the communications for the DC switchover process. We will spend some time this quarter (Q1) in working... [19:18:39] RhinosF1: i don't know enough to rule that out [19:19:58] (03Merged) 10jenkins-bot: Consistently normalize Title::mFragment before setting [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702711 (https://phabricator.wikimedia.org/T285951) (owner: 10Brennen Bearnes) [19:21:18] (that is, yeah, it feels like something caching related, but i'm not sure how that is or isn't interacting with train.) [19:23:45] Jdlrobson: Krinkle ^ [19:24:06] You both have Minerva changes [19:24:42] brennen: my guess is some change causes css to be rendered that is invalid on the previous version [19:24:51] So rolling forward is fine but not back [19:24:58] that seems plausible. [19:25:21] rolling forward ought to provide a test of that, at any rate. [19:25:25] There used to be a file somewhere to invalidate cache from a specific time [19:25:37] But very quick searches do not find [19:25:50] Which patch we are talking about, how far is it deployed? [19:26:20] .12 is currently only on group1, seeing broken link styling on https://commons.m.wikimedia.org/wiki/Main_Page [19:26:22] Krinkle: mobile website seems to have styling issues on group 1 wikis post rollback [19:26:25] sorry: only on group0 [19:27:06] is there a task? define broken. It looks ok at a glance [19:27:20] Krinkle: https://phabricator.wikimedia.org/T285966 [19:27:42] The first screenshot there is in VisualEditor [19:27:53] (03CR) 10Ottomata: [C: 03+1] Grant icinga permissions to btullis [puppet] - 10https://gerrit.wikimedia.org/r/702739 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [19:28:51] what error do you see at https://commons.m.wikimedia.org/wiki/Main_Page ? [19:29:06] Krinkle: uploaded a screenshot to the task [19:29:36] OK, it only happens when logged-out it seems [19:29:41] I can repro [19:29:43] Yeah [19:29:49] I can too logged out [19:30:22] I defer to Jdlrobson. It looks like there are missing styles indeed. It is adding a link background image, but not setting any background position or no-repeat rules, so it just repeats [19:30:26] and lots of other styles missing as well [19:30:56] Krinkle: commons ain't on .12 though yet [19:30:58] It was [19:31:04] But rolled back [19:31:12] So cache must be blameable then [19:31:14] the cached copy uses wmf.12 indeed [19:31:24] When I add ?snlala it renders fine logged0out [19:31:36] so I guess this means Minerva made breaking changes to its style modules [19:31:44] Krinkle: do we know how to ban cache [19:31:54] loading a new module that doesn't exist before, with styles that arent covered by the same module names before [19:32:09] usually that needs to be primed first in a separate deployment 7 days ahead [19:32:36] (03PS1) 10Andrew Bogott: profile::toolforge::disable_tool: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/702745 (https://phabricator.wikimedia.org/T170355) [19:32:37] wmf.12 cache: [19:32:44] Krinkle: there used to be a file to ban a certain time [19:32:50] wmf.11 query bypass: [19:32:52] (03CR) 10jerkins-bot: [V: 04-1] profile::toolforge::disable_tool: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/702745 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:33:03] RhinosF1: these are not static files, the purge logic for /static as for site logos does not apply here [19:33:05] this is about HTML caches [19:33:08] which are part of parser cache etc. [19:33:21] Ah [19:33:25] the HTML loads a stylesheet, but the problem isn't with the stylesheet itself [19:33:47] OK, so "skins.minerva.content.styles" is missing from the wmf.12 entry [19:33:52] I may be to blame for this. [19:34:02] https://gerrit.wikimedia.org/r/q/0d61c78f [19:34:06] You will be [19:34:14] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.12/includes/Title.php: Backport: [[gerrit:702711|Consistently normalize Title::mFragment before setting (T285951)]] (duration: 01m 10s) [19:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:21] T285951: Some section links in search results are redlinks - https://phabricator.wikimedia.org/T285951 [19:34:34] (03CR) 10Btullis: [C: 03+2] Grant icinga permissions to btullis [puppet] - 10https://gerrit.wikimedia.org/r/702739 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [19:34:59] Yeah, I shoould have kept skins.minerva.content.styles defined as-is [19:35:19] I made for forward-compat but not back-compat which only surfaces during rollback [19:35:32] depending on whether roll forward happens quicker than me patching it, I can patch it. [19:35:52] (03PS2) 10Andrew Bogott: profile::toolforge::disable_tool: standardize on the singular 'disable_tool' [puppet] - 10https://gerrit.wikimedia.org/r/702745 (https://phabricator.wikimedia.org/T170355) [19:35:53] easy enough to keep it defined, then the new stylesheet urls that are in some HTML caches now will automatically start working [19:35:55] Krinkle: other blockers are resolved as soon as a sync finishes here momentarily. [19:35:57] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.12/tests/phpunit/includes/TitleMethodsTest.php: Backport: [[gerrit:702711|Consistently normalize Title::mFragment before setting (T285951)]] (duration: 01m 10s) [19:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:07] brennen: ok, I'll leave it then, but as a lesson for next time. [19:36:26] k, thanks for investigating all. [19:36:41] rolling forward momentarily. [19:37:26] (03PS1) 10Brennen Bearnes: group1 wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702747 [19:37:28] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702747 (owner: 10Brennen Bearnes) [19:37:44] (03CR) 10Andrew Bogott: [C: 03+2] profile::toolforge::disable_tool: standardize on the singular 'disable_tool' [puppet] - 10https://gerrit.wikimedia.org/r/702745 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:38:19] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702747 (owner: 10Brennen Bearnes) [19:39:51] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.12 [19:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:04] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.12 (duration: 01m 12s) [19:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:58] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Ezarat... [19:42:48] (03PS1) 10Brennen Bearnes: all wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702749 [19:42:50] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702749 (owner: 10Brennen Bearnes) [19:43:31] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702749 (owner: 10Brennen Bearnes) [19:43:37] brennen: here and log watching [19:45:04] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.12 [19:45:05] saw a strange pulse of UserIdentityValue deprecation errors [19:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:14] perhaps transient [19:46:39] Krinkle: looks perfect again now [19:46:59] RhinosF1: details at https://phabricator.wikimedia.org/T266361#7191087 [19:47:18] the wmf.12 code was already backwards compatible, so all caches work again now both new and old. [19:47:33] I forgot to test the rollback scenario, woudl have been a 1 line fix. [19:48:54] Krinkle: no problem, thanks for helping work it out [20:00:36] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:14:40] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:17:13] 10SRE: Please add btullis@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T285936 (10BTullis) 05Open→03Resolved a:03BTullis Many thanks. [20:17:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:21:37] (03CR) 10Andrew Bogott: [C: 03+1] "thanks! Pretty sure I've created that dir by hand a few times" [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [20:22:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:23:17] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/702677 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [20:25:30] 10SRE, 10Performance-Team, 10serviceops, 10MW-1.36-notes, and 3 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10Krinkle) So the problem appears to be bad interactions between WANCache's "pre-emptive regeneration" feature (as prompted by... [20:51:32] I don't know where to ask this but where can I find node12 docker images in WMF's docker-registry? catalog (https://docker-registry.wikimedia.org/v2/_catalog) didn't have any node12 images [20:57:23] found it, it seems it's not in production namespace but I could use the releng namespace for my usecase [21:04:14] 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10TheDJ) This is a libpng error (via image magick). Likely these images were always problematic, but the proble... [21:06:44] (03PS2) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [21:09:24] (03PS3) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [21:13:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:15:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:23:22] Amir1: there's no production node12 image since no one has started moving services to bullseye yet [21:23:50] noted [21:23:53] thanks [21:24:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:26:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:27:55] brennen: still around? [21:28:50] RhinosF1: here [21:29:14] brennen: does https://en.wikipedia.org/wiki/Special:EmailUser/RhinosF1 say $1 for you [21:29:20] Can reproduce for all users [21:29:45] say $1 where? [21:29:50] RhinosF1: screenshot please [21:30:15] https://usercontent.irccloud-cdn.com/file/Dy8Dwc7T/1625175010.JPG [21:30:31] brennen: everywhere that should be my username in the notice [21:30:34] legoktm: ^ [21:31:00] I see your name properly substituted in the message [21:31:23] yeah, i can't repro while logged in [21:32:39] happening for other users besides yourself? [21:32:50] (ftr: I can't reproduce either) [21:33:16] thcipriani: by the lack of reports I'd guess no [21:33:23] legoktm: very strange [21:33:43] RhinosF1: what does ?uselang=qqx say? [21:33:58] it should be "(emailpagetext: RhinosF1)" [21:34:41] legoktm: enter the username in https://en.wikipedia.org/wiki/Special:EmailUser [21:34:46] (03PS4) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [21:34:51] I don't get /username on the end [21:34:55] Didn't that used to happen [21:35:21] the URL is off yes, but the top message is still correct [21:35:41] (03PS1) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 [21:36:08] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (owner: 10Ryan Kemper) [21:36:15] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+log/refs/heads/master/includes/specials/SpecialEmailUser.php seems unlikely to be a recent regression in any case [21:36:22] legoktm: (emailpagetext: RhinosF1) when I add the /RhinosF1 [21:36:28] Just takes me to enter username otherwise [21:36:45] $1 still on /RhinosF1 too [21:37:30] err. to clarify, when visiting https://en.wikipedia.org/wiki/Special:EmailUser/RhinosF1 you see $1, and when visiting https://en.wikipedia.org/wiki/Special:EmailUser/RhinosF1?uselang=qqx you see (emailpagetext: RhinosF1) ? [21:37:38] Yes [21:38:28] and the $1 shows up if you just visit the link? [21:38:53] (03PS1) 10Ahmon Dancy: Temporarily disable notification for security patch failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702755 [21:39:23] (03CR) 10Ahmon Dancy: [C: 03+2] Temporarily disable notification for security patch failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702755 (owner: 10Ahmon Dancy) [21:40:05] (03Merged) 10jenkins-bot: Temporarily disable notification for security patch failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702755 (owner: 10Ahmon Dancy) [21:40:24] jouncebot help [21:40:24] **** JounceBot Help **** [21:40:24] JounceBot is a deployment helper bot for the Wikimedia Foundation. [21:40:24] You can find my source at https://github.com/mattofak/jouncebot [21:40:24] Available commands: [21:40:24] HELP Print all commands known to the server. [21:40:25] NEXT Get the next deployment event(s if they happen at the same time). [21:40:25] NOW Get the current deployment event(s) or the time until the next. [21:40:25] legoktm: ye [21:40:25] REFRESH Refresh my knowledge about deployments. [21:40:32] jouncebot hnow [21:40:34] jouncebot now [21:40:35] No deployments scheduled for the next 1 hour(s) and 19 minute(s) [21:40:41] Excellent. [21:41:46] RhinosF1: I'm pretty stumped. file a bug? [21:42:32] it doesn't seem blocker worthy unless other people experirence it too but still weird [21:43:08] !log dancy@deploy1002 Synchronized .pipeline/config.yaml: Config: [[gerrit:702755|Temporarily disable notification for security patch failures]] (duration: 00m 57s) [21:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:28] https://phabricator.wikimedia.org/T285985 [21:44:59] * bd808 sees the outdated bits of `jouncebot help` and winces [21:47:31] (03PS5) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [21:51:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10wiki_willy) Just got off the phone with Dell. It's escalated on their side, and they're going to sync up tomorrow in figuring out a solution for this, which could very w... [21:53:45] (03PS1) 10BryanDavis: Update `help` message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/702758 [21:55:09] (03CR) 10Legoktm: [C: 03+2] Update `help` message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/702758 (owner: 10BryanDavis) [21:55:39] (03Merged) 10jenkins-bot: Update `help` message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/702758 (owner: 10BryanDavis) [21:58:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:58:50] (03PS6) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [22:00:26] (03PS7) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [22:00:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:05:24] (03CR) 10Bstorm: [C: 03+2] d/changelog: Prepare for 0.75 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm) [22:06:52] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.75 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/700095 (owner: 10Bstorm) [22:11:23] (03PS8) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [22:12:52] (03PS9) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [22:14:18] (03PS10) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [22:15:19] (03PS11) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [22:17:56] (03PS12) 10Ahmon Dancy: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) [22:24:12] jouncebot: help [22:24:12] **** JounceBot Help **** [22:24:12] JounceBot is a deployment helper bot for the Wikimedia movement. [22:24:12] Source at: https://gerrit.wikimedia.org/g/wikimedia/bots/jouncebot [22:24:13] Available commands: [22:24:13] HELP Print all commands known to the server. [22:24:13] NEXT Get the next deployment event(s if they happen at the same time). [22:24:13] NOW Get the current deployment event(s) or the time until the next. [22:24:14] REFRESH Refresh my knowledge about deployments. [22:24:23] that looks a bit better :) [22:24:28] Nice work [22:27:06] jouncebot now [22:27:06] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [22:27:26] (03PS1) 10Zabe: Add 'editautoreviewprotected' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702761 (https://phabricator.wikimedia.org/T275076) [22:27:35] !log Start server-side upload for 1 video file (T285682) [22:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:44] T285682: Server side upload for Victorgrigas - https://phabricator.wikimedia.org/T285682 [22:28:07] (03CR) 10Ahmon Dancy: [C: 03+2] "Tested at https://releases-jenkins.wikimedia.org/job/mediawiki-config-pipeline-wmf-publish/197/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) (owner: 10Ahmon Dancy) [22:29:30] (03Merged) 10jenkins-bot: Use train-versions.json to map from version to image tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702704 (https://phabricator.wikimedia.org/T282824) (owner: 10Ahmon Dancy) [22:31:12] !log dancy@deploy1002 Synchronized .pipeline: Config: [[gerrit:702704|Use train-versions.json to map from version to image tag (T282824)]] (duration: 00m 57s) [22:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:20] T282824: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 [22:33:02] 10SRE, 10Wikimedia-Mailing-lists: Redirect https://lists.wikimedia.org/pipermail/foobar/ to https://lists.wikimedia.org/hyperkitty/list/foobar@lists.wikimedia.org/ - https://phabricator.wikimedia.org/T285949 (10Legoktm) a:03Legoktm [22:36:00] !log Start server-side upload for 1 video file (T285789) [22:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:08] T285789: Server side upload for 고려 - https://phabricator.wikimedia.org/T285789 [22:37:35] !log Start server-side upload for 1 video file (T285182) [22:37:41] (03PS3) 10Cwhite: logstash: add ECS transition support for Oslo structured logs [puppet] - 10https://gerrit.wikimedia.org/r/695563 (https://phabricator.wikimedia.org/T234565) [22:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:43] T285182: Server side upload for PantheraLeo1359531 - https://phabricator.wikimedia.org/T285182 [22:38:15] (03PS2) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 [22:38:19] (03PS1) 10Legoktm: mailman3: Redirect pipermail list archive index to hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/702767 (https://phabricator.wikimedia.org/T285949) [22:39:00] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (owner: 10Ryan Kemper) [22:39:29] (03PS3) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [22:40:13] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [22:41:02] (03CR) 10Legoktm: [C: 03+2] mailman3: Redirect pipermail list archive index to hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/702767 (https://phabricator.wikimedia.org/T285949) (owner: 10Legoktm) [22:41:08] (03CR) 10Ryan Kemper: "PCC looks good. Thanks for all your work on this @Muehlenhoff" [puppet] - 10https://gerrit.wikimedia.org/r/702580 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [22:41:12] (03CR) 10Ryan Kemper: [C: 03+2] elastic: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/702580 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [22:47:01] (03PS4) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [22:47:13] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Redirect https://lists.wikimedia.org/pipermail/foobar/ to https://lists.wikimedia.org/hyperkitty/list/foobar@lists.wikimedia.org/ - https://phabricator.wikimedia.org/T285949 (10Legoktm) ` km@cashew ~> curl -I "https://lists.wikimedia.org/pipermail/xtools/... [22:47:38] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [22:50:52] (03PS5) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [22:51:20] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [22:55:20] (03CR) 10Cwhite: [C: 03+2] logstash: add ECS transition support for Oslo structured logs [puppet] - 10https://gerrit.wikimedia.org/r/695563 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:56:13] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Addshore) Any idea on a timeline for being able to get this ticket moving? It's blocking T176312 which... [23:00:05] brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for US Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210701T2300). [23:00:05] zabe: A patch you scheduled for US Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:16] o/ [23:05:48] hey zabe I was reading the back-and-forth on the task and I can't quite tell what's going on: would it be OK to move this patch to a later window after folks have had some time to review? [23:06:45] ok, sounds fair [23:07:29] (03PS6) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [23:07:58] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [23:08:15] (03PS1) 10Thcipriani: deployment training: readme whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702774 [23:08:46] thanks for understanding zabe <3 [23:09:31] (03CR) 10Thcipriani: [C: 03+2] deployment training: readme whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702774 (owner: 10Thcipriani) [23:09:54] (03PS7) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [23:10:14] (03Merged) 10jenkins-bot: deployment training: readme whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702774 (owner: 10Thcipriani) [23:10:29] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [23:14:15] (03PS8) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [23:15:02] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [23:17:16] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:18:32] (03PS9) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [23:19:54] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [23:20:50] (03PS10) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [23:21:38] !log thcipriani@deploy1002 Synchronized README: Config: [[gerrit:702774|deployment training: readme whitespace]] (duration: 00m 57s) [23:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:14] (03CR) 10jerkins-bot: [V: 04-1] cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [23:22:29] (03CR) 10Ryan Kemper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [23:25:47] (03PS1) 10Thcipriani: Revert "deployment training: readme whitespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702777 [23:27:05] (03CR) 10Thcipriani: [C: 03+2] Revert "deployment training: readme whitespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702777 (owner: 10Thcipriani) [23:27:44] (03Merged) 10jenkins-bot: Revert "deployment training: readme whitespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702777 (owner: 10Thcipriani) [23:29:47] !log thcipriani@deploy1002 Synchronized README: Config: [[gerrit:702777|Revert "deployment training: readme whitespace"]] (duration: 00m 56s) [23:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:47] (03CR) 10Cwhite: [C: 03+1] prometheus: don't deploy alerts to 'global' instance by default [puppet] - 10https://gerrit.wikimedia.org/r/702599 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [23:37:19] (03PS11) 10Ryan Kemper: cirrus: systemd timer for readahead script [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) [23:38:38] PROBLEM - NFS Share Volume Space /srv/scratch on cloudstore1008 is CRITICAL: DISK CRITICAL - free space: /srv/scratch 595580 MB (15% inode=99%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [23:40:50] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper) [23:57:33] (03CR) 10Ebernhardson: [C: 03+1] "Seems overall reasonable. Wish we had a better place for the binary, but i think this will do." [puppet] - 10https://gerrit.wikimedia.org/r/702754 (https://phabricator.wikimedia.org/T264053) (owner: 10Ryan Kemper)