[00:00:21] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:20] (03PS3) 10Ryan Kemper: wcqs: state change: monitoring_setup -> production [puppet] - 10https://gerrit.wikimedia.org/r/724536 (https://phabricator.wikimedia.org/T280001) [00:13:42] (03CR) 10Ryan Kemper: [V: 03+2] "only changed commit message" [puppet] - 10https://gerrit.wikimedia.org/r/724536 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [00:13:45] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wcqs: state change: monitoring_setup -> production [puppet] - 10https://gerrit.wikimedia.org/r/724536 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [00:14:34] !log T280001 Moving wcqs state from `monitoring_setup` to `production`; merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/724536 [00:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:42] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [00:15:05] !log T280001 `ryankemper@cumin1001:~$ sudo cumin 'A:icinga or A:dns-auth' run-puppet-agent` per https://wikitech.wikimedia.org/wiki/LVS#Make_the_service_page,_add_discovery_resources [00:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:10] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/724538 (https://phabricator.wikimedia.org/T282117) (owner: 10Ryan Kemper) [00:19:32] !log T280001 Okay now we're clear to proceed to https://wikitech.wikimedia.org/wiki/LVS#For_active/active_services; merging https://gerrit.wikimedia.org/r/c/operations/dns/+/724538 [00:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:25] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:21:19] !log T280001 `ryankemper@authdns1001:~$ sudo -i authdns-update` following merge of https://gerrit.wikimedia.org/r/c/operations/dns/+/724538 [00:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:26] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [00:21:59] (03PS1) 10Legoktm: Have SyntaxHighlight use Shellbox on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724544 [00:23:35] (03PS1) 10Ryan Kemper: wcqs: add disc desired state [puppet] - 10https://gerrit.wikimedia.org/r/724545 (https://phabricator.wikimedia.org/T280001) [00:24:55] (03CR) 10Legoktm: [C: 03+2] Have SyntaxHighlight use Shellbox on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724544 (owner: 10Legoktm) [00:25:11] (03PS2) 10Ryan Kemper: wcqs: add disc desired state [puppet] - 10https://gerrit.wikimedia.org/r/724545 (https://phabricator.wikimedia.org/T280001) [00:25:21] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724545 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [00:25:47] (03Merged) 10jenkins-bot: Have SyntaxHighlight use Shellbox on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724544 (owner: 10Legoktm) [00:27:36] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Have SyntaxHighlight use Shellbox on all wikis (duration: 01m 18s) [00:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:01] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: add disc desired state [puppet] - 10https://gerrit.wikimedia.org/r/724545 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [00:45:54] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:57:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) @jijiki @Dzahn this is all ready for service Thank you. [01:24:12] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Papaul) I looked at this server today and did some power drain as well it looks like a main board issue to me on this server. @wiki_willy the server is out of warranty since 02/2021 [01:29:40] 10SRE, 10Sustainability (Incident Followup): A puppet run should not start if a box is under abnormal load. - https://phabricator.wikimedia.org/T84183 (10Krinkle) [01:30:33] 10SRE, 10DBA: Migrate parsercache away from being a full RDBMS - https://phabricator.wikimedia.org/T84187 (10Krinkle) [01:30:50] 10SRE, 10DBA: Migrate parsercache away from being a full RDBMS - https://phabricator.wikimedia.org/T84187 (10Krinkle) [01:35:08] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10Papaul) Create a case with Dell. case bellow case#: 123351649 Tag: R740XD: Crashes | ProSupport: NBD | - Linux - [01:36:03] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw: Ship back Raritan test PDU - https://phabricator.wikimedia.org/T287762 (10Papaul) 05Open→03Resolved This is complete [01:52:10] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10Papaul) I see 1 error here. Bios is on 1.5.4 , New BIOS is 2.12.2 : https://dl.dell.com/FOLDER07551855M/4/BIOS_4CRD2_WN64_2.12.2.EXE Idrac is on 3.21.21.21 Current is 5.00.10.00 Raid... [02:04:43] 10SRE-Access-Requests: Requesting access to Superset for gehel - https://phabricator.wikimedia.org/T292040 (10cchen) [02:06:11] 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for gehel - https://phabricator.wikimedia.org/T292040 (10cchen) [02:10:54] 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for gehel - https://phabricator.wikimedia.org/T292040 (10cchen) [02:28:59] (03PS1) 10Legoktm: Have PdfHandler/PagedTiffHandler use Shellbox on all wikis but Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724572 [02:33:19] (03CR) 10Legoktm: [C: 03+2] Have PdfHandler/PagedTiffHandler use Shellbox on all wikis but Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724572 (owner: 10Legoktm) [02:34:07] (03Merged) 10jenkins-bot: Have PdfHandler/PagedTiffHandler use Shellbox on all wikis but Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724572 (owner: 10Legoktm) [02:36:34] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Have PdfHandler/PagedTiffHandler use Shellbox on all wikis but Commons (duration: 01m 07s) [02:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:18] (03PS1) 10Legoktm: Have PdfHandler use Shellbox on 10% of requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724576 (https://phabricator.wikimedia.org/T289228) [02:46:20] (03PS1) 10Legoktm: Have PagedTiffHandler use Shellbox on 10% of requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724577 (https://phabricator.wikimedia.org/T289228) [02:50:10] (03PS2) 10Legoktm: Have PdfHandler use Shellbox on Commons for 10% of requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724576 (https://phabricator.wikimedia.org/T289228) [02:50:12] (03PS2) 10Legoktm: Have PagedTiffHandler use Shellbox on Commons for 10% of requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724577 (https://phabricator.wikimedia.org/T289228) [04:24:14] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:29:40] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:38:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:39:38] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:50:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2081 T290868', diff saved to https://phabricator.wikimedia.org/P17342 and previous config saved to /var/cache/conftool/dbconfig/20210929-045033-marostegui.json [04:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:43] T290868: Upgrade s8 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290868 [04:51:16] (03PS1) 10Marostegui: db2081: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/724584 (https://phabricator.wikimedia.org/T290868) [04:53:29] (03CR) 10Marostegui: [C: 03+2] db2081: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/724584 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [04:54:16] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:55:38] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:40] 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560 (10Joe) 05Open→03Resolved [04:57:38] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:03:38] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:14] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:17] (03PS1) 10Marostegui: clouddb1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/724586 (https://phabricator.wikimedia.org/T291963) [05:04:59] (03CR) 10Marostegui: "As this host is going to go under maintenance - I am disabling its notifications for now." [puppet] - 10https://gerrit.wikimedia.org/r/724586 (https://phabricator.wikimedia.org/T291963) (owner: 10Marostegui) [05:05:04] (03CR) 10Marostegui: [C: 03+2] clouddb1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/724586 (https://phabricator.wikimedia.org/T291963) (owner: 10Marostegui) [05:13:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:15] (03PS1) 10Marostegui: filtered_tables.txt: Remove references to flaggedimages [puppet] - 10https://gerrit.wikimedia.org/r/724589 (https://phabricator.wikimedia.org/T290340) [05:24:25] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Remove references to flaggedimages [puppet] - 10https://gerrit.wikimedia.org/r/724589 (https://phabricator.wikimedia.org/T290340) (owner: 10Marostegui) [05:25:55] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for gehel - https://phabricator.wikimedia.org/T292040 (10Joe) Given @Gehel already has full access to all of production, this should only need a signoff / acknowledgement from @odimitrijevic or @Ottomata. [05:26:07] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for gehel - https://phabricator.wikimedia.org/T292040 (10Joe) p:05Triage→03Medium a:03Joe [05:37:30] (03PS1) 10Giuseppe Lavagetto: admin: add saisuman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/724591 (https://phabricator.wikimedia.org/T291948) [05:37:32] (03PS1) 10Giuseppe Lavagetto: admin: add gehel to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/724592 (https://phabricator.wikimedia.org/T292040) [05:42:08] (03CR) 10Ryan Kemper: [C: 03+1] "LGTM! The comments are very clear. Will deploy this weds (haven't merged anything from operations/alerts before so not sure if it's just a" [alerts] - 10https://gerrit.wikimedia.org/r/724423 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [05:48:39] ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2 Marostegui https://phabricator.wikimedia.org/T291963 https://wikitech.wikimedia.org/wiki/HAProxy [05:53:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:53:40] (03PS2) 10Ryan Kemper: query service: Fix loading of DCAT-AP dataset [puppet] - 10https://gerrit.wikimedia.org/r/720746 (https://phabricator.wikimedia.org/T289517) (owner: 10DCausse) [05:53:55] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:55:57] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2103 T290865', diff saved to https://phabricator.wikimedia.org/P17344 and previous config saved to /var/cache/conftool/dbconfig/20210929-055645-marostegui.json [05:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:54] T290865: Upgrade s1 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290865 [05:57:55] (03PS1) 10Marostegui: Revert "db2103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/724550 [05:58:20] (03PS6) 10Ryan Kemper: search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) (owner: 10DCausse) [05:58:30] (03CR) 10Marostegui: [C: 03+2] Revert "db2103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/724550 (owner: 10Marostegui) [06:00:52] (03PS7) 10Ryan Kemper: search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) (owner: 10DCausse) [06:02:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:03:39] (03CR) 10Ryan Kemper: [C: 03+2] query service: Fix loading of DCAT-AP dataset [puppet] - 10https://gerrit.wikimedia.org/r/720746 (https://phabricator.wikimedia.org/T289517) (owner: 10DCausse) [06:09:49] !log T289517 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/720746 (fix dcat-ap loading) [06:09:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:55] T289517: DCAT AP endpoint is down - https://phabricator.wikimedia.org/T289517 [06:10:01] !log T289517 Ran puppet across query_service fleet `sudo cumin -b 6 'P{w*qs*}' 'sudo run-puppet-agent'` [06:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:15] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:46] (03CR) 10Ryan Kemper: [C: 03+2] search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) (owner: 10DCausse) [06:12:45] (03Merged) 10jenkins-bot: search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) (owner: 10DCausse) [06:13:44] (03PS4) 10Ryan Kemper: search-platform: Fix flink app crashloop detection [alerts] - 10https://gerrit.wikimedia.org/r/724423 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [06:15:12] !log Deploy schema change on s8 codfw (lag will show up) T283499 [06:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:20] T283499: Schema change for renaming page_timestamp index on revision table to rev_page_timestamp - https://phabricator.wikimedia.org/T283499 [06:15:59] (03CR) 10Ryan Kemper: [C: 03+2] search-platform: Fix flink app crashloop detection [alerts] - 10https://gerrit.wikimedia.org/r/724423 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [06:18:02] (03Merged) 10jenkins-bot: search-platform: Fix flink app crashloop detection [alerts] - 10https://gerrit.wikimedia.org/r/724423 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [06:22:58] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2035 - device sdi:6 unavailable - https://phabricator.wikimedia.org/T291896 (10Joe) p:05Triage→03High [06:32:21] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2036 - device sdg:4 unavailable - https://phabricator.wikimedia.org/T291988 (10Joe) p:05Triage→03High For the record, `sdg` has many bad sectors (according to kern.log) and should probably be substituted. Nit sure why the alert only fired yesterday th... [06:34:53] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:29] (03CR) 10Marostegui: [C: 03+1] "uuid and mail are correct" [puppet] - 10https://gerrit.wikimedia.org/r/724591 (https://phabricator.wikimedia.org/T291948) (owner: 10Giuseppe Lavagetto) [06:39:09] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:03] 10SRE, 10MW-on-K8s, 10serviceops: Repartition mediawiki servers - https://phabricator.wikimedia.org/T291918 (10Joe) p:05Triage→03High I think the title is misleading, I spent 10 minutes trying to figure out what partitioning schemes had to do with moving to kubernetes :D Amending it. [06:43:35] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) [06:57:15] (03CR) 10DCausse: [C: 03+1] "thanks for updating the image, I'll deploy this today" [deployment-charts] - 10https://gerrit.wikimedia.org/r/724464 (owner: 10PipelineBot) [06:59:07] (03Abandoned) 10DCausse: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/705695 (owner: 10PipelineBot) [07:00:39] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:02:45] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:04:40] <_joe_> ug [07:05:35] <_joe_> volans: I see you just logged into the server, found anything? [07:06:06] _joe_: I saw the problem and tried to ping+ssh, both worked for me [07:06:06] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10wiki_willy) Hi @Papaul - do you have any decom'd servers around to replace this one? If not, we can either see if Service Ops would ok decommissioning it a couple years before its next refresh, o... [07:06:10] and then the recovery came [07:06:24] was not a reboot, that I can tell :) [07:07:58] <_joe_> yeah I can't see anything worth notice in syslog atm [07:09:52] (03CR) 10Hashar: [C: 03+1] fix: scap: remove confusing logstash dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/724515 (https://phabricator.wikimedia.org/T291870) (owner: 10Thcipriani) [07:10:05] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:01] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:34] (03PS1) 10Volans: sre.experimental.reimage: fix Phabricator messages [cookbooks] - 10https://gerrit.wikimedia.org/r/724687 [07:17:44] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: better check of OS [cookbooks] - 10https://gerrit.wikimedia.org/r/724461 (owner: 10Volans) [07:18:10] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) The first scenario I proposed in T290536 goes as follows: * One cluster for first deploy/debug purposes (kube-mwdebug) * One cluster to serve internal requests to t... [07:20:07] (03Merged) 10jenkins-bot: sre.experimental.reimage: better check of OS [cookbooks] - 10https://gerrit.wikimedia.org/r/724461 (owner: 10Volans) [07:21:00] (03PS2) 10Volans: sre.experimental.reimage: fix Phabricator messages [cookbooks] - 10https://gerrit.wikimedia.org/r/724687 [07:23:50] (03PS3) 10Muehlenhoff: microsites: Switch to wmflib::dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/724053 [07:23:51] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [07:25:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2081 T290868', diff saved to https://phabricator.wikimedia.org/P17345 and previous config saved to /var/cache/conftool/dbconfig/20210929-072520-marostegui.json [07:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:28] T290868: Upgrade s8 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290868 [07:25:40] (03CR) 10Muehlenhoff: [C: 03+2] microsites: Switch to wmflib::dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/724053 (owner: 10Muehlenhoff) [07:25:43] (03PS1) 10Filippo Giunchedi: Add deploying section to README [alerts] - 10https://gerrit.wikimedia.org/r/724688 [07:27:12] (03CR) 10Muehlenhoff: [C: 03+2] Create landing page for invidual OS overviews [puppet] - 10https://gerrit.wikimedia.org/r/724470 (owner: 10Muehlenhoff) [07:28:39] (03CR) 10Filippo Giunchedi: [C: 03+2] Add deploying section to README [alerts] - 10https://gerrit.wikimedia.org/r/724688 (owner: 10Filippo Giunchedi) [07:28:44] (03PS2) 10Filippo Giunchedi: Add deploying section to README [alerts] - 10https://gerrit.wikimedia.org/r/724688 [07:28:56] (03PS13) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [07:29:07] (03CR) 10David Caro: [C: 03+2] ldap::sssd: remove unused parameter ldapincludes [puppet] - 10https://gerrit.wikimedia.org/r/724004 (owner: 10David Caro) [07:29:20] (03PS3) 10David Caro: ldap::sssd: remove unused parameter ldapincludes [puppet] - 10https://gerrit.wikimedia.org/r/724004 [07:29:48] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Add deploying section to README [alerts] - 10https://gerrit.wikimedia.org/r/724688 (owner: 10Filippo Giunchedi) [07:34:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add saisuman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/724591 (https://phabricator.wikimedia.org/T291948) (owner: 10Giuseppe Lavagetto) [07:35:25] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [07:35:59] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [07:37:46] (03PS14) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [07:38:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T291948 (10Joe) 05Open→03Resolved @SCherukuwada within the next hour you should be able to access superset. I'm tentatively resolving the task, please re... [07:39:44] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10dcaro) [07:43:24] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [07:51:04] !log fail sdg on be2036 - T291988 [07:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:11] T291988: swift - ms-be2036 - device sdg:4 unavailable - https://phabricator.wikimedia.org/T291988 [07:54:47] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2036 - device sdg:4 unavailable - https://phabricator.wikimedia.org/T291988 (10fgiunchedi) @papaul please replace the 4TB drive, should be blinking, thank you! [07:55:14] (03PS3) 10Volans: sre.experimental.reimage: fix Phabricator messages [cookbooks] - 10https://gerrit.wikimedia.org/r/724687 [07:55:16] (03PS1) 10Volans: sre.hosts.downtime: poll Icinga status [cookbooks] - 10https://gerrit.wikimedia.org/r/724691 [07:57:16] (03CR) 10Volans: "A bit of context is in:" [cookbooks] - 10https://gerrit.wikimedia.org/r/724691 (owner: 10Volans) [07:58:23] (03PS2) 10WMDE-Fisch: Enable line numbering on all namespaces (pilot wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722279 (https://phabricator.wikimedia.org/T280027) (owner: 10Awight) [07:58:53] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Volans) >>! In T290190#7386065, @Papaul wrote: > @Volans I was able to get thumbor2005 installed without adding the MAC address but the install failed a... [08:03:24] (03Abandoned) 10David Caro: base: Add test and fix notification condition [puppet] - 10https://gerrit.wikimedia.org/r/724074 (owner: 10David Caro) [08:04:03] (03CR) 10Ema: [C: 03+2] rsyslog: abort on unclean config [puppet] - 10https://gerrit.wikimedia.org/r/720921 (https://phabricator.wikimedia.org/T290870) (owner: 10Ema) [08:05:52] (03PS1) 10Muehlenhoff: Generate separate distro status page [puppet] - 10https://gerrit.wikimedia.org/r/724692 [08:06:53] PROBLEM - Check systemd state on ms-be2036 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] fix: scap: remove confusing logstash dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/724515 (https://phabricator.wikimedia.org/T291870) (owner: 10Thcipriani) [08:08:15] (03CR) 10Muehlenhoff: [C: 03+2] Generate separate distro status page [puppet] - 10https://gerrit.wikimedia.org/r/724692 (owner: 10Muehlenhoff) [08:10:21] PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:23] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:53] PROBLEM - Check systemd state on aqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:02] the above are likely related to https://gerrit.wikimedia.org/r/720921 - looking [08:11:09] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:09] PROBLEM - Check systemd state on restbase2020 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:13] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:13] PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:19] PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:29] PROBLEM - Check systemd state on elastic2050 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:33] PROBLEM - Check systemd state on ldap-replica2005 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:51] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02222 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:11:53] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:26] ack [08:12:39] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:41] PROBLEM - Check systemd state on rdb2010 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:43] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:43] godog: looks like it may be due to the hosts running stretch? [08:12:53] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:57] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:59] ema: agreed, it does seem like that [08:12:59] PROBLEM - Check systemd state on ores2003 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:05] PROBLEM - Check systemd state on aqs1014 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:13] PROBLEM - Check systemd state on restbase1026 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:13] PROBLEM - Check systemd state on restbase2015 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:19] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:25] PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:26] safest is to rollback IMHO for now [08:13:29] godog: I'll revert for now, yeah [08:13:31] PROBLEM - HP RAID on ms-be2036 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:13:34] ACKNOWLEDGEMENT - HP RAID on ms-be2036 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T292046 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:13:37] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:37] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:38] 10SRE, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T292046 (10ops-monitoring-bot) [08:13:45] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:58] (03PS1) 10Ema: Revert "rsyslog: abort on unclean config" [puppet] - 10https://gerrit.wikimedia.org/r/724551 [08:14:01] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:02] 10SRE, 10ops-codfw: Degraded RAID on ms-be2036 - https://phabricator.wikimedia.org/T292046 (10fgiunchedi) [08:14:11] PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:15] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/724551 [08:14:21] PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:22] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "rsyslog: abort on unclean config" [puppet] - 10https://gerrit.wikimedia.org/r/724551 (owner: 10Ema) [08:14:29] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:34] ema: LGTM! thank you [08:14:35] PROBLEM - Check systemd state on restbase1021 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:37] PROBLEM - Device not healthy -SMART- on ms-be2036 is CRITICAL: cluster=swift device=None instance=ms-be2036 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops [08:14:39] PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:41] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2036 - device sdg:4 unavailable - https://phabricator.wikimedia.org/T291988 (10fgiunchedi) [08:14:55] PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:59] PROBLEM - Check systemd state on elastic1042 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:05] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:11] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:21] PROBLEM - Check systemd state on restbase1029 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:21] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:25] PROBLEM - Check systemd state on ldap-replica1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:29] PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:35] PROBLEM - Check systemd state on elastic1034 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:39] PROBLEM - Check systemd state on aqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:45] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:50] (03CR) 10Ema: [V: 03+2 C: 03+2] Revert "rsyslog: abort on unclean config" [puppet] - 10https://gerrit.wikimedia.org/r/724551 (owner: 10Ema) [08:15:53] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:55] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:01] PROBLEM - Check systemd state on ores2005 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:11] PROBLEM - Check systemd state on restbase1022 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:21] PROBLEM - Check systemd state on people2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:23] PROBLEM - Check systemd state on restbase2023 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:29] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:35] PROBLEM - Check systemd state on sessionstore2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:35] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:41] PROBLEM - Check systemd state on people1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:45] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:57] PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:07] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:09] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:15] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:17] PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:25] PROBLEM - Check systemd state on ores1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:31] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:51] ACKNOWLEDGEMENT - snapshot of s4 in codfw on alert1001 is CRITICAL: Last snapshot for s4 at codfw (db2139.codfw.wmnet:3314) taken on 2021-09-28 21:19:24 is 1531 GB, but previous one was 1803 GB, a change of 15.1% Jcrespo ongoing table optimizations - The acknowledgement expires at: 2021-09-30 08:17:18. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:17:57] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:57] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:59] PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:59] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:01] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable line numbering on all namespaces (pilot wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722279 (https://phabricator.wikimedia.org/T280027) (owner: 10Awight) [08:18:07] PROBLEM - Check systemd state on ores1006 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:09] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:13] PROBLEM - Check systemd state on sessionstore1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:21] PROBLEM - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:23] PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:23] PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:25] PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:37] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:59] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:05] PROBLEM - Check systemd state on ldap-replica1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:07] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:19] PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:25] PROBLEM - Check systemd state on elastic1060 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:35] (03PS1) 10Hashar: Split canary jobrunner to their own role [puppet] - 10https://gerrit.wikimedia.org/r/724694 (https://phabricator.wikimedia.org/T291870) [08:19:37] (03PS1) 10Hashar: Set a CANARY env variable for mediawiki canaries [puppet] - 10https://gerrit.wikimedia.org/r/724695 (https://phabricator.wikimedia.org/T291870) [08:20:05] PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:13] PROBLEM - Check systemd state on rdb1012 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:35] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:35] PROBLEM - Check systemd state on aqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:47] PROBLEM - Check systemd state on debmonitor2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:50] (03CR) 10jerkins-bot: [V: 04-1] Set a CANARY env variable for mediawiki canaries [puppet] - 10https://gerrit.wikimedia.org/r/724695 (https://phabricator.wikimedia.org/T291870) (owner: 10Hashar) [08:21:15] PROBLEM - Check systemd state on restbase2011 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:15] PROBLEM - Check systemd state on restbase2022 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:21] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:21] PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:23] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:29] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog-tls-remedy.service,rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:29] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:33] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:33] (03CR) 10jerkins-bot: [V: 04-1] Split canary jobrunner to their own role [puppet] - 10https://gerrit.wikimedia.org/r/724694 (https://phabricator.wikimedia.org/T291870) (owner: 10Hashar) [08:21:41] PROBLEM - Check systemd state on ores1005 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:45] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:11] RECOVERY - Check systemd state on elastic2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:29] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [08:22:52] !log fleet-wide rm /etc/rsyslog.d/00-abort-unclean-config.conf && systemctl restart rsyslog [08:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:35] RECOVERY - Check systemd state on thumbor1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:37] RECOVERY - Check systemd state on thanos-fe2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:39] RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:39] RECOVERY - Check systemd state on elastic1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:40] RECOVERY - Check systemd state on elastic1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:43] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1000 days) https://wikitech.wikimedia.org/wiki/Logs [08:24:43] RECOVERY - Check systemd state on people2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:45] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:45] RECOVERY - Check systemd state on aqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:47] RECOVERY - Check systemd state on restbase2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:51] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:51] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:51] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:57] RECOVERY - Check systemd state on debmonitor2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:57] RECOVERY - Check systemd state on sessionstore2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:57] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:57] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:01] RECOVERY - Check systemd state on people1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:01] RECOVERY - Check systemd state on elastic1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:03] RECOVERY - Check systemd state on restbase1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:05] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:05] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:07] RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:13] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:13] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:17] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:17] RECOVERY - Check systemd state on rdb2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:19] RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:19] RECOVERY - Check systemd state on ldap-replica1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:21] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:23] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:23] RECOVERY - Check systemd state on restbase2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:25] RECOVERY - Check systemd state on restbase2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:27] RECOVERY - Check systemd state on elastic1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:29] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:29] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:31] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:31] RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:31] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:31] RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:35] RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:35] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:35] RECOVERY - Check systemd state on aqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:35] RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:37] RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:39] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:39] RECOVERY - Check systemd state on elastic1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:39] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:39] RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:40] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:41] RECOVERY - Check systemd state on aqs1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:43] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:47] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:47] (03CR) 10Filippo Giunchedi: alerts: add multiple tags match (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724354 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [08:25:49] RECOVERY - Check systemd state on restbase1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:49] RECOVERY - Check systemd state on restbase1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:49] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:51] RECOVERY - Check systemd state on restbase2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:51] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:51] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:53] RECOVERY - Check systemd state on ldap-replica1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:53] RECOVERY - Check systemd state on restbase2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:55] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:55] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:56] RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:56] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:57] RECOVERY - Check systemd state on restbase2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:59] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:03] RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:03] RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:03] RECOVERY - Check systemd state on elastic1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:07] (03PS1) 10Hashar: logging: set canary field when host is a canary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724696 (https://phabricator.wikimedia.org/T291870) [08:26:07] RECOVERY - Check systemd state on aqs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:13] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:13] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:13] RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:14] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:17] RECOVERY - Check systemd state on ldap-replica2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:19] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:21] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:21] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:21] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:23] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:23] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:23] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:29] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:29] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:29] RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:29] RECOVERY - Check systemd state on rdb1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:35] RECOVERY - Check systemd state on sessionstore1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:35] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:39] RECOVERY - Check systemd state on restbase1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:37] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The following units failed: session-205221.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:07] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/724464 (owner: 10PipelineBot) [08:34:23] (03PS2) 10Arturo Borrero Gonzalez: manila: use manila-srv service user rather than novaadmin for auth [puppet] - 10https://gerrit.wikimedia.org/r/724500 (https://phabricator.wikimedia.org/T291257) (owner: 10Andrew Bogott) [08:35:37] (03PS2) 10Hashar: Set a CANARY env variable for mediawiki canaries [puppet] - 10https://gerrit.wikimedia.org/r/724695 (https://phabricator.wikimedia.org/T291870) [08:36:25] (03Merged) 10jenkins-bot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/724464 (owner: 10PipelineBot) [08:39:40] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [08:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:32] (03CR) 10Hashar: Split canary jobrunner to their own role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724694 (https://phabricator.wikimedia.org/T291870) (owner: 10Hashar) [08:44:51] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10WMDE-Fisch) [08:49:23] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: session-205404.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:27] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003989 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:51:45] PROBLEM - SSH on ms-fe2006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:55:41] 10SRE, 10SRE Observability (FY2021/2022-Q1): rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10ema) >>! In T290870#7386995, @gerritbot wrote: > Change 720921 **merged** by Ema: > %%%[operations/puppet@production] rsyslog: abort on unclean config%%% > https://ger... [08:58:04] !log dcausse@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [08:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:13] RECOVERY - Check systemd state on ms-be2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:49] (03PS3) 10Elukey: helmfile.d: add user deploy-kserve [deployment-charts] - 10https://gerrit.wikimedia.org/r/724448 (https://phabricator.wikimedia.org/T286791) [09:06:07] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:21] (03PS3) 10Arturo Borrero Gonzalez: manila: use manila-srv service user rather than novaadmin for auth [puppet] - 10https://gerrit.wikimedia.org/r/724500 (https://phabricator.wikimedia.org/T291257) (owner: 10Andrew Bogott) [09:11:15] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:04] (03PS3) 10Alexandros Kosiaris: k8s: Instruct docker to keep logs at 100M, followup [puppet] - 10https://gerrit.wikimedia.org/r/719551 (https://phabricator.wikimedia.org/T289578) [09:18:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31356/" [puppet] - 10https://gerrit.wikimedia.org/r/724500 (https://phabricator.wikimedia.org/T291257) (owner: 10Andrew Bogott) [09:19:33] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [09:21:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] "No issues met at staging, merging. Note that the default max-file = 1 so no need to specify it here." [puppet] - 10https://gerrit.wikimedia.org/r/719551 (https://phabricator.wikimedia.org/T289578) (owner: 10Alexandros Kosiaris) [09:25:59] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:43] (03PS1) 10Ema: rsyslog: remove 00-abort-unclean-config.conf [puppet] - 10https://gerrit.wikimedia.org/r/724707 (https://phabricator.wikimedia.org/T290870) [09:36:39] (03CR) 10Filippo Giunchedi: [C: 03+1] mtail: add counter for kernel traps [puppet] - 10https://gerrit.wikimedia.org/r/721773 (https://phabricator.wikimedia.org/T246470) (owner: 10Elukey) [09:37:01] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: remove 00-abort-unclean-config.conf [puppet] - 10https://gerrit.wikimedia.org/r/724707 (https://phabricator.wikimedia.org/T290870) (owner: 10Ema) [09:37:03] (03CR) 10Btullis: statistics::product_analytics: create and prepare (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [09:38:13] (03PS15) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [09:39:52] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [09:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:03] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 11s) [09:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:13] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:41:18] (03CR) 10Elukey: [C: 03+2] mtail: add counter for kernel traps [puppet] - 10https://gerrit.wikimedia.org/r/721773 (https://phabricator.wikimedia.org/T246470) (owner: 10Elukey) [09:45:56] (03CR) 10ZPapierski: Added spicerack.kafka with offset transfer function (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [09:47:03] !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [09:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10cmooney) Thanks for the detail @aborrero Looking at the setup the logical thing is to allocate the public IPs for these hosts fro... [09:52:43] RECOVERY - SSH on ms-fe2006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:52:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/724691 (owner: 10Volans) [09:53:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/724687 (owner: 10Volans) [09:53:59] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: fix Phabricator messages [cookbooks] - 10https://gerrit.wikimedia.org/r/724687 (owner: 10Volans) [09:54:07] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: poll Icinga status [cookbooks] - 10https://gerrit.wikimedia.org/r/724691 (owner: 10Volans) [09:54:21] !log bounce mtail on centrallog* - T246470 [09:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:27] T246470: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 [09:56:35] (03Merged) 10jenkins-bot: sre.experimental.reimage: fix Phabricator messages [cookbooks] - 10https://gerrit.wikimedia.org/r/724687 (owner: 10Volans) [09:56:40] (03Merged) 10jenkins-bot: sre.hosts.downtime: poll Icinga status [cookbooks] - 10https://gerrit.wikimedia.org/r/724691 (owner: 10Volans) [09:59:21] (03CR) 10Jelto: helmfile.d: add user deploy-kserve (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/724448 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:00:09] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin1001.eqiad.wmnet with reason: testing latest change [10:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:11] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin1001.eqiad.wmnet with reason: testing latest change [10:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:51] (03CR) 10Elukey: "Thanks for the review! Replied to the comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/724448 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:16:56] (03CR) 10Jelto: [C: 03+1] "okay thanks for the explanation. lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/724448 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:23:21] (03PS1) 10Kosta Harlan: Revert "Search header should be vertically centered, not top aligned." [skins/MinervaNeue] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724553 (https://phabricator.wikimedia.org/T292030) [10:23:53] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [10:24:03] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [10:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] helmfile.d: add user deploy-kserve [deployment-charts] - 10https://gerrit.wikimedia.org/r/724448 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:27:26] (03PS1) 10Muehlenhoff: os-reports: Show remaining days until a distro is EOLed [puppet] - 10https://gerrit.wikimedia.org/r/724715 [10:33:32] (03PS2) 10Muehlenhoff: os-reports: Show remaining days until a distro is EOLed [puppet] - 10https://gerrit.wikimedia.org/r/724715 [10:34:50] !log volans@cumin2002 END (ERROR) - Cookbook sre.experimental.reimage (exit_code=97) for host sretest1001.eqiad.wmnet [10:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:27] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [10:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:06] (03CR) 10jerkins-bot: [V: 04-1] Revert "Search header should be vertically centered, not top aligned." [skins/MinervaNeue] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724553 (https://phabricator.wikimedia.org/T292030) (owner: 10Kosta Harlan) [10:41:09] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:37] (03CR) 10Kosta Harlan: "recheck" [skins/MinervaNeue] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724553 (https://phabricator.wikimedia.org/T292030) (owner: 10Kosta Harlan) [10:44:01] (03PS1) 10Hnowlan: secrets: Clean up restbase stub certificates [labs/private] - 10https://gerrit.wikimedia.org/r/724717 [10:48:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [10:48:55] e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IP [10:48:55] ve - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitor [10:48:55] P_status [10:48:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [10:49:43] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [10:49:49] that's me and it might be bigger than expected^^ [10:49:53] are there ongoing issues? lots ot socket timeouts [10:50:07] yeah yeah kubernetes calico issues [10:50:10] I may have jumped the gun [10:50:17] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:50:23] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:50:27] (KubernetesCalicoDown) firing: (17) kubernetes1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:50:29] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:50:29] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:37] PROBLEM - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is CRITICAL: /{format}/ (mass-energy equivalence (mml)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mathoid [10:50:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad,swagger_check_echostore_eqiad,swagger_check_eventgate_analytics_external_cluster_eqiad,swagger_check_eventstreams_internal_cluster_eqiad,swagger_check_mathoid_cluster_eqiad,swagger_check_mobileapps_cluster_eqiad,swagger_check_termbox_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prom [10:50:39] 3Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:50:51] ok, I think it should start fixing itself now, but it will take a while before icinga sees it [10:50:57] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POS [10:50:57] PROBLEM - Apache HTTP on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:50:58] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 1 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:51:01] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:13] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=api_appserver&var-datasource=eqiad%20prometheus%2Fops&var-method=GET&viewPanel=9 seems to be going down [10:51:38] PROBLEM - Apache HTTP on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:51:38] PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:51:38] PROBLEM - Apache HTTP on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:51:38] PROBLEM - PHP7 rendering on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:51:49] RECOVERY - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mathoid [10:51:52] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:51:52] yeah this is my mistake, I destroyed kubernetes networking [10:51:57] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [10:52:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:52:09] RECOVERY - Apache HTTP on mw1383 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:52:16] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:52:19] RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.573 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:52:19] RECOVERY - Apache HTTP on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:52:19] RECOVERY - Apache HTTP on mw1378 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.369 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:52:22] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 81, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:52:29] RECOVERY - PHP7 rendering on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.290 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:52:43] <_joe_> uh wtf? [10:52:54] <_joe_> why is mediawiki having issues too? [10:53:02] I'm here too [10:53:12] _joe_: sessionstore probably [10:53:12] PROBLEM - MediaWiki edit session loss on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/edit-count?panelId=13&fullscreen&orgId=1 [10:53:18] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:53:25] <_joe_> ok sessionstore [10:53:33] <_joe_> should we fail it to codfw akosiaris ? [10:53:50] no, I think we are ok now (we have been for some time actually) [10:53:56] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) puppetdb was fairly stable for some time, however we had to add some of the facts back into puppetdb specifically the numa and partitions... [10:53:58] Wikidata edits seem to be recovering already [10:53:59] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [10:54:06] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:54:11] <_joe_> yeah [10:54:18] ack, thanks [10:54:24] <_joe_> clearly losing networking to sessionstore does creatre a lot of issues [10:54:24] latency seems good, but traffic seems like 15% lower at the moment [10:54:37] <_joe_> I think we're not out of the woods [10:55:15] error rates are 0 again [10:55:26] 5xx that is [10:55:27] (KubernetesCalicoDown) resolved: (17) kubernetes1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:56:18] going to appreciate that ^ is 17 alerts but one irc notification [10:56:18] <_joe_> yeah long fetch times are over now in the mediawiki logs too [10:56:36] I wonder if the traffic thing is an artifact of monitoring (average over some time) [10:56:44] going back to lunch [10:57:29] I 'll draft a short incident report to let everyone know how I almost broke wikipedia and how it got fixed on its own. [10:57:31] or tail of reporting long running requests [10:57:44] <_joe_> * akosiaris is now known as chaos_monkey [10:57:48] but traffic seems now back to normal levels [10:57:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [10:58:03] looking forward to the report on the thing that fixed itself, sounds cool :) [10:58:18] _joe_: I think I 've had that emmm *honor?* for some time now. [10:58:32] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04839 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:58:36] <_joe_> we all do, from time to time [10:58:53] <_joe_> we need more aggressive circuit breaking with sessionstore I fear [10:59:16] it's on 0.2s right now IIRC [10:59:32] but for what is worth, I think all of sessionstore was unreachable [10:59:47] <_joe_> no I mean the circuit breaking, returning 503 fast to mediawiki [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1100). [11:00:05] MatmaRex, WMDE-Fisch, Lucas_WMDE, and kostajh: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] let’s hold for a bit [11:00:25] \o [11:00:36] hi [11:00:44] <_joe_> akosiaris: interstingly, sessionstore responded 500 to mediawiki for soem time, I guess it couldn't talk to cassandra [11:00:51] akosiaris, _joe_, godog: can you let us know when we’re good to proceed with the backport window? [11:01:07] \o [11:01:11] Lucas_WMDE: yeah, give it another 5-6 minutes to make sure, but it looks ok already. [11:01:14] (I’m assuming you want some break time after this first) [11:01:14] <_joe_> Lucas_WMDE: I'd mostly wait for logstash to catch up to its lag [11:01:16] ok [11:01:23] * Lucas_WMDE looks at the calendar [11:01:30] * WMDE-Fisch will not be able to deploy myself [11:01:53] https://grafana.wikimedia.org/d/000001590/sessionstore?viewPanel=46&orgId=1&from=1632912431713&to=1632912843141&var-dc=thanos&var-site=eqiad&var-service=sessionstore&var-prometheus=k8s&var-container_name=kask-production [11:02:01] yeah, all of sessionstore was unavailable for some time [11:02:05] MatmaRex: do you know how long DiscussionTools CI usually takes? wondering if we should already +2 and then wait for it [11:02:15] but that's not specific to session store, everything in kubernetes was unavailable for some time [11:02:30] <_joe_> akosiaris: in eqiad, one thing we could implement at some point is to have envoy failover to the other dc automatically [11:02:34] <_joe_> but anyways [11:02:43] <_joe_> all good for now, impressive it self-healed :D [11:02:47] Lucas_WMDE: a few minutes i guess? i'm not really sure. mostly the same as any extension [11:02:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [11:02:58] I’m used to 20-30 minutes from Wikibase :P [11:03:00] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:03:02] let’s wait a bit more then [11:03:09] ^ especically if recoveries are still trickling in [11:03:17] Lucas_WMDE: oh, more like 10 minutes [11:03:22] <_joe_> Lucas_WMDE: you can start merging patches if you want to [11:03:30] ok, then let’s +2 these [11:03:32] but not the config change yet [11:03:37] (since config CI is faster) [11:03:39] thanks _joe_ [11:03:53] <_joe_> akosiaris: i am not sure mobileapps recovered [11:03:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:04:09] <_joe_> or if it's just icinga being slow to recover [11:04:14] RECOVERY - MediaWiki edit session loss on graphite1004 is OK: OK: Less than 30.00% above the threshold [10.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/edit-count?panelId=13&fullscreen&orgId=1 [11:04:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix almost all errors codes being logged as `http-0` [extensions/DiscussionTools] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724378 (https://phabricator.wikimedia.org/T290514) (owner: 10Bartosz Dziewoński) [11:04:44] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) Its also worth noting that the increase in processing times no longer aligns with [[ https://grafana.wikimedia.org/d/C0lCOf3Mz/puppetdb-p... [11:04:44] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [11:04:48] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix almost all errors codes being logged as `http-0` [extensions/DiscussionTools] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724379 (https://phabricator.wikimedia.org/T290514) (owner: 10Bartosz Dziewoński) [11:05:01] <_joe_> yeah a lot of it is icinga being slow [11:05:03] _joe_: icinga is certainly slow [11:05:42] Lucas_WMDE: thanks for deploying! I'm available to test the line numbering patch on mwdebug, when you get to it. [11:05:44] but cxserver indeed seems to be in some weird woods right now [11:05:48] awight: ack [11:05:58] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [11:06:05] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/724553 will take a while in CI if you want to +2 that now [11:06:05] <_joe_> akosiaris: yeah I think a few things could use a rolling restart of pods [11:06:06] <_joe_> :/ [11:06:15] <_joe_> losing networking can cause such situations [11:06:19] this graph is interesting, too: https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&from=1632910894671&to=1632913524604 [11:06:22] it's actually what brought down everything (the rolling restart) [11:06:36] <_joe_> of what, calico-node? [11:06:40] it just wasn't that good of a rolling restart [11:06:41] yes [11:07:19] jynus: it's the same pattern, isn't it? [11:07:32] inability to do almost anything for a few mins [11:07:43] yeah, but I am trying to think why it didn't recover as quickly [11:08:03] for us lurkers, can you explain what went wrong with the rolling restart? [11:08:05] as it does when it is not a long issue [11:08:25] so I am thinking it maybe the session failed for ongoing edits too? [11:09:09] <_joe_> akosiaris: yeah somehow icinga is not picking up things recovering [11:09:13] apergos: calico-nodes where restarted in a rolling fashion but before they could correctly peer with the routers, a component they rely on, calico-typha was restarted too. [11:09:15] <_joe_> and it says it checked! [11:09:40] oic [11:09:48] ok makes sense [11:10:12] (03Merged) 10jenkins-bot: Fix almost all errors codes being logged as `http-0` [extensions/DiscussionTools] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724378 (https://phabricator.wikimedia.org/T290514) (owner: 10Bartosz Dziewoński) [11:10:13] _joe_: average service check latency is like 35secs from what I see [11:10:31] <_joe_> akosiaris: well lvs1016's pybal has been healthy for 10 minutes now [11:10:31] (brb) [11:10:41] that's a lot then [11:10:43] <_joe_> not to talk abotu all the appservers [11:10:44] mh, I was about to ask MatmaRex if that change is testable [11:11:15] (03Merged) 10jenkins-bot: Fix almost all errors codes being logged as `http-0` [extensions/DiscussionTools] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724379 (https://phabricator.wikimedia.org/T290514) (owner: 10Bartosz Dziewoński) [11:11:25] <_joe_> same with the rb nodes [11:11:31] <_joe_> I'd say we're out of the woods [11:11:39] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:11:45] <_joe_> Lucas_WMDE: you can go on IMHO, dunno if akosiaris agrees [11:11:49] ok [11:11:59] yeah, I think you are good to go [11:12:03] ok thanks [11:12:15] I’m guessing the change won’t be testable because it’s in error handling from MediaWiki [11:13:23] _joe_: the only error left is the recent changes one [11:13:43] CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. [11:13:52] pulled both backports to mwdebug1001, I’ll briefly check that DiscussionTools isn’t totally broken [11:14:05] akosiaris, I think that is a real issue, just not user impacting [11:14:18] when there are lot of mw errors, logging gets behind [11:14:34] not ideal, but "expected" [11:14:49] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) >>! In T290536#7383383, @akosiaris wrote: >>>! In T290536#7383272, @jijiki wrote: > > > That's currently my preferred way cause it's determinis... [11:15:16] jynus: not sure I follow. I am not familiar with that alert on the other hand. [11:15:16] oh, that is recentchanges, so ignore my last comments [11:15:24] ok [11:15:25] I thought it was the logging stream [11:15:55] ok, DiscussionTools still working afaict, let’s sync (wmf.2 first because that’s only on group0 so far) [11:15:57] !log volans@cumin2002 END (ERROR) - Cookbook sre.experimental.reimage (exit_code=97) for host sretest1001.eqiad.wmnet [11:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:27] eventstreams in eqiad seems to be pretty fine [11:16:34] (back) [11:16:40] (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Show remaining days until a distro is EOLed [puppet] - 10https://gerrit.wikimedia.org/r/724715 (owner: 10Muehlenhoff) [11:16:48] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [11:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:54] MatmaRex: I’m deploying the backports [11:17:07] I assumed the specific change wasn’t testable and just made sure DiscussionTools was still generally working [11:17:08] Lucas_WMDE: thanks, sorry, i was away for a minute [11:17:10] akosiaris, yeah I can see it sending the last recent events [11:17:21] maybe the checker got, shomehow, stuck or something? [11:17:38] I’m syncing to wmf.2 first, if you want to test it on test wikis before I sync wmf.1 let me know [11:17:44] yeah double checking it [11:17:47] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/DiscussionTools/modules/dt.ui.ReplyWidget.js: Backport: [[gerrit:724379|Fix almost all errors codes being logged as `http-0` (T290514)]] (duration: 01m 09s) [11:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:52] T290514: Reply tool has gotten slower? - https://phabricator.wikimedia.org/T290514 [11:18:46] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [11:18:53] ah there we go [11:18:56] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) > however there is an earlier peak at ~10:15 on the 21st worth exploring This alligned to the following puppetdb log event ` 2021-09-21T... [11:19:15] ok, crisis fully over. I am going for lunch [11:19:16] I think it was a force I did [11:19:22] thanks, akosiaris ! [11:19:25] in icinga? force check? [11:19:25] \o/ [11:19:32] yeah [11:19:38] syncing the backport to wmf.1 now [11:19:59] I’ll +2 the other two backports so they can start going through gate-and-submit [11:20:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724371 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:20:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "Search header should be vertically centered, not top aligned." [skins/MinervaNeue] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724553 (https://phabricator.wikimedia.org/T292030) (owner: 10Kosta Harlan) [11:20:43] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.1/extensions/DiscussionTools/modules/dt.ui.ReplyWidget.js: Backport: [[gerrit:724378|Fix almost all errors codes being logged as `http-0` (T290514)]] (duration: 01m 09s) [11:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable line numbering on all namespaces (pilot wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722279 (https://phabricator.wikimedia.org/T280027) (owner: 10Awight) [11:22:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:41] (03Merged) 10jenkins-bot: Enable line numbering on all namespaces (pilot wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722279 (https://phabricator.wikimedia.org/T280027) (owner: 10Awight) [11:23:18] awight: the change should be on mwdebug1002, please test [11:23:28] Lucas_WMDE: ack [11:24:28] Lucas_WMDE: It works :-) [11:24:44] ok :) [11:25:38] (03CR) 10Hnowlan: [V: 03+1] "pcc testing of restbase2023 (which is our test subject for the FQDN certificates) is blocked on https://gerrit.wikimedia.org/r/c/labs/priv" [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [11:25:56] (03CR) 10Ema: [C: 03+2] rsyslog: remove 00-abort-unclean-config.conf [puppet] - 10https://gerrit.wikimedia.org/r/724707 (https://phabricator.wikimedia.org/T290870) (owner: 10Ema) [11:26:15] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:722279|Enable line numbering on all namespaces (pilot wikis) (T280027)]] (duration: 01m 09s) [11:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:21] T280027: Enable line numbering on all namespaces on first wikis - https://phabricator.wikimedia.org/T280027 [11:26:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:44] alright, now we wait for Wikibase and MinervaNeue… [11:31:42] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:05] a handful of shellbox 503 are showing up in logstash again [11:34:14] two from syntaxhighlight, one from score [11:34:22] but probably not enough to worry about yet [11:36:14] (03PS1) 10Arturo Borrero Gonzalez: openstack: manila: use manilainfra project to host service instances [puppet] - 10https://gerrit.wikimedia.org/r/724725 (https://phabricator.wikimedia.org/T291257) [11:38:35] 10SRE, 10Maps, 10Platform Team Workboards (Platform Engineering Reliability): Maps postgres read replicas throws errors on eqiad - https://phabricator.wikimedia.org/T289852 (10hnowlan) 05Open→03Resolved [11:38:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:01] (03Merged) 10jenkins-bot: Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724371 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:41:04] (03Merged) 10jenkins-bot: Revert "Search header should be vertically centered, not top aligned." [skins/MinervaNeue] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724553 (https://phabricator.wikimedia.org/T292030) (owner: 10Kosta Harlan) [11:41:28] let’s do wikibase first, that one is basically not testable so I’ll just roll it out [11:41:33] (it’s also been on wmf.1 since yesterday) [11:42:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:02] !log volans@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1001.eqiad.wmnet [11:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:27] kostajh: I’ll be ready to deploy your change in 1-2 minutes, fyi [11:43:35] sounds good [11:43:58] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/Wikibase/repo/includes/Store/Sql/SqlSiteLinkConflictLookup.php: Backport: [[gerrit:724371|Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup (T291377)]] (duration: 01m 07s) [11:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:03] T291377: Prevent creation of items having the same sitelinks (duplicates) using memcached and database locks - https://phabricator.wikimedia.org/T291377 [11:44:33] ok [11:45:05] kostajh: the change should be on mwdebug1002, can you test it? [11:45:16] Lucas_WMDE: yep looking [11:45:28] ok [11:46:06] Lucas_WMDE: lgtm [11:46:12] ok, syncing [11:48:00] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.2/skins/MinervaNeue/skinStyles/mobile.startup/Overlay.less: Backport: [[gerrit:724553|Revert "Search header should be vertically centered, not top aligned." (T292030)]] (duration: 01m 07s) [11:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:06] T292030: [regression-wmf.2] VE icons displayed incorrectly on mobile - https://phabricator.wikimedia.org/T292030 [11:48:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: manila: use manilainfra project to host service instances [puppet] - 10https://gerrit.wikimedia.org/r/724725 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [11:48:26] alright, deployment window done I think [11:48:40] (03CR) 10Ssingh: haproxy: Allow loading lua scripts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/720273 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:48:42] thanks Lucas_WMDE! [11:48:45] I might backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaBadges/+/724710 later (fixes an uncommon prod error), but it hasn’t been reviewed yet [11:48:49] !log EU backport+config window done [11:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:18] (03PS1) 10Arturo Borrero Gonzalez: openstack: manila: refresh services on config file change [puppet] - 10https://gerrit.wikimedia.org/r/724726 (https://phabricator.wikimedia.org/T291257) [11:55:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: manila: refresh services on config file change [puppet] - 10https://gerrit.wikimedia.org/r/724726 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [11:55:25] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31357/console" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:56:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:28] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:03:01] (03CR) 10Ssingh: [C: 03+1] haproxy: Add PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/720021 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:06:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:36] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:09:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:05] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:20] (03CR) 10DCausse: [C: 03+1] alerts: add multiple tags match (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724354 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [12:23:12] (03CR) 10Ssingh: haproxy: Allow adding/removing HTTP headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/720272 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:23:55] (03PS1) 10Jbond: P:tlsproxy::instance: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/724730 (https://phabricator.wikimedia.org/T263578) [12:26:06] (03CR) 10Ssingh: haproxy: STEK support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:26:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31358/console" [puppet] - 10https://gerrit.wikimedia.org/r/724730 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [12:27:15] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#7386927, @Joe wrote: > The first scenario I proposed in T290536 goes as follows: > * One cluster for first deploy/debug purposes (kube-mwdebug) >... [12:34:28] (03PS1) 10Bartosz Dziewoński: Make reply tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724732 (https://phabricator.wikimedia.org/T288485) [12:36:30] (03PS1) 10Jbond: P:tlsproxy::instance: Drop numa_networking global [puppet] - 10https://gerrit.wikimedia.org/r/724733 (https://phabricator.wikimedia.org/T263578) [12:38:08] (03CR) 10jerkins-bot: [V: 04-1] P:tlsproxy::instance: Drop numa_networking global [puppet] - 10https://gerrit.wikimedia.org/r/724733 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [12:39:34] (03PS6) 10Jbond: interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) [12:40:27] (03CR) 10Jbond: "Could i get further review of this change. this would allow use to drop the numa fact which would help with puppetdb issues we are having" [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [12:40:38] (03PS4) 10Jbond: interfaces: remove ethtool configueration [puppet] - 10https://gerrit.wikimedia.org/r/662699 (https://phabricator.wikimedia.org/T236208) [12:42:37] (03PS5) 10Jbond: interfaces: remove ethtool configueration [puppet] - 10https://gerrit.wikimedia.org/r/662699 (https://phabricator.wikimedia.org/T236208) [12:43:25] (03Abandoned) 10Jbond: numa_networking: drop numa_networking global variable [puppet] - 10https://gerrit.wikimedia.org/r/662700 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [12:45:13] (03PS4) 10Jbond: (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) [12:45:24] (03PS4) 10Jbond: systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) [12:46:12] (03CR) 10jerkins-bot: [V: 04-1] systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [12:46:22] (03CR) 10jerkins-bot: [V: 04-1] (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [12:48:06] (03PS2) 10Jbond: P:tlsproxy::instance: Drop numa_networking global [puppet] - 10https://gerrit.wikimedia.org/r/724733 (https://phabricator.wikimedia.org/T263578) [12:49:09] (03PS16) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [12:51:37] (03PS2) 10Filippo Giunchedi: prometheus: add instance-specific alerts path [puppet] - 10https://gerrit.wikimedia.org/r/724353 (https://phabricator.wikimedia.org/T289662) [12:52:02] (03PS1) 10Volans: sre.experimental.reimage: fix installer check [cookbooks] - 10https://gerrit.wikimedia.org/r/724735 [12:55:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:tlsproxy::instance: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/724730 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [12:56:51] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:00:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/724735 (owner: 10Volans) [13:00:27] (03CR) 10Elukey: [C: 03+2] helmfile.d: add user deploy-kserve [deployment-charts] - 10https://gerrit.wikimedia.org/r/724448 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [13:00:42] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: fix installer check [cookbooks] - 10https://gerrit.wikimedia.org/r/724735 (owner: 10Volans) [13:03:28] (03Merged) 10jenkins-bot: sre.experimental.reimage: fix installer check [cookbooks] - 10https://gerrit.wikimedia.org/r/724735 (owner: 10Volans) [13:04:22] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:53] https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-29_eqiad-kubernetes [13:06:58] fyi ^ [13:07:51] thanks! [13:08:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:08:04] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:37] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:09:37] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:46] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add instance-specific alerts path [puppet] - 10https://gerrit.wikimedia.org/r/724353 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [13:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:48] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [13:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:06] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:11:06] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:45] (03PS1) 10Jbond: depdeploy:client: always create the debdeploy-client directory [puppet] - 10https://gerrit.wikimedia.org/r/724737 [13:12:02] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) >>! In T291918#7387656, @jijiki wrote: > Naming things is hard though, I do not agree with the `kube` prefix, in the future after baremetal mediawiki servers are go... [13:12:20] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: add multiple tags match (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724354 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [13:12:38] (03PS1) 10DCausse: rdf-streaming-updater: Reduce flink mem [deployment-charts] - 10https://gerrit.wikimedia.org/r/724738 [13:14:01] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Joe) I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our debug image, but it should allow us to dogfo... [13:14:07] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: deploy instance-specific alerts [puppet] - 10https://gerrit.wikimedia.org/r/724355 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [13:15:48] (03PS1) 10Volans: sre.experimental.reimage: less verbose output [cookbooks] - 10https://gerrit.wikimedia.org/r/724739 [13:21:41] (03CR) 10Jbond: [C: 03+2] depdeploy:client: always create the debdeploy-client directory [puppet] - 10https://gerrit.wikimedia.org/r/724737 (owner: 10Jbond) [13:23:10] (03CR) 10ZPapierski: [C: 03+1] rdf-streaming-updater: Reduce flink mem [deployment-charts] - 10https://gerrit.wikimedia.org/r/724738 (owner: 10DCausse) [13:23:22] (03PS3) 10Jelto: profile::gitlab start using gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) [13:25:14] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Reduce flink mem [deployment-charts] - 10https://gerrit.wikimedia.org/r/724738 (owner: 10DCausse) [13:26:27] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31361/console" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [13:28:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/724739 (owner: 10Volans) [13:28:01] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for gehel - https://phabricator.wikimedia.org/T292040 (10Ottomata) Approved [13:28:53] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: less verbose output [cookbooks] - 10https://gerrit.wikimedia.org/r/724739 (owner: 10Volans) [13:29:37] (03Merged) 10jenkins-bot: rdf-streaming-updater: Reduce flink mem [deployment-charts] - 10https://gerrit.wikimedia.org/r/724738 (owner: 10DCausse) [13:31:44] (03Merged) 10jenkins-bot: sre.experimental.reimage: less verbose output [cookbooks] - 10https://gerrit.wikimedia.org/r/724739 (owner: 10Volans) [13:31:44] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [13:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:41] !log volans@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1001.eqiad.wmnet [13:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:39] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:43] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:26] (03PS4) 10Jelto: profile::gitlab start using gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) [13:44:29] (03PS1) 10Elukey: helmfile.d: fix ml-services private config settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/724745 [13:45:01] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31362/console" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [13:45:12] (03CR) 10Brennen Bearnes: gitlab / idp: open gitlab access to all users [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [13:45:19] (03PS2) 10Brennen Bearnes: gitlab / idp: open gitlab access to all users [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) [13:49:03] (03CR) 10Elukey: [C: 03+2] helmfile.d: fix ml-services private config settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/724745 (owner: 10Elukey) [13:51:15] (03CR) 10Brennen Bearnes: "T288392 is resolved, think we're ready to go ahead with this." [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [13:53:39] (03PS1) 10Elukey: secrets: add quotes to label values [deployment-charts] - 10https://gerrit.wikimedia.org/r/724746 [13:53:49] (03CR) 10Jelto: [V: 03+1] "With this change the gitlab profile starts using gitlab modules. I had to do some refactoring because some resources were duplicate (copy " [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [13:58:44] (03CR) 10Elukey: [C: 03+2] secrets: add quotes to label values [deployment-charts] - 10https://gerrit.wikimedia.org/r/724746 (owner: 10Elukey) [13:58:59] (03CR) 10Btullis: "Looks fine to me. Great to be closing down a ticket from 2016." [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [14:01:25] (03CR) 10Btullis: cassandra: use FQDN in CN name for future instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [14:01:59] !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [14:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:58] 10SRE-Access-Requests: Requesting access to RESOURCE for Swakiyama - https://phabricator.wikimedia.org/T292069 (10cchen) [14:04:10] 10SRE-Access-Requests, 10Product-Analytics: Requesting access to RESOURCE for Swakiyama - https://phabricator.wikimedia.org/T292069 (10cchen) [14:04:29] !log pt1979@cumin2002 START - Cookbook sre.experimental.reimage for host thumbor2006.codfw.wmnet [14:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:35] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet [14:07:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:07:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:32] (03CR) 10Jelto: [C: 03+2] gitlab / idp: open gitlab access to all users [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [14:08:45] !log dcausse@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [14:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:10] (03PS1) 10DCausse: alertmanager: Add irc notification for search-platform alerts [puppet] - 10https://gerrit.wikimedia.org/r/724752 (https://phabricator.wikimedia.org/T276467) [14:17:52] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2001.codfw.wmnet [14:17:55] (03PS1) 10Muehlenhoff: Add testvm2003 [puppet] - 10https://gerrit.wikimedia.org/r/724753 (https://phabricator.wikimedia.org/T286206) [14:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:40] (03CR) 10Muehlenhoff: [C: 03+2] Add testvm2003 [puppet] - 10https://gerrit.wikimedia.org/r/724753 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [14:22:25] (03CR) 10DCausse: "flink alerts do seem sane" [puppet] - 10https://gerrit.wikimedia.org/r/724752 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [14:25:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts testvm2001.codfw.wmnet [14:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:39] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2001.codfw.wmnet` - testvm2001.codfw.wmnet (**WARN**) - //Host not found on Ici... [14:28:34] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#7387775, @Joe wrote: >>>! In T291918#7387656, @jijiki wrote: >> Naming things is hard though, I do not agree with the `kube` prefix, in the future... [14:30:20] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10jijiki) >>! In T291918#7387778, @Joe wrote: > I forgot to add: we probably also want to migrate wikitech early in the process. It will need us to add php-ldap to our deb... [14:32:06] (03PS1) 10Ladsgroup: Track time until dispatched recent changes are inserted [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724558 (https://phabricator.wikimedia.org/T291962) [14:35:05] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:03] (03PS1) 10Ottomata: Remove thorium from linkrecommendation egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/724754 (https://phabricator.wikimedia.org/T285355) [14:38:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:44] 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Swakiyama - https://phabricator.wikimedia.org/T292069 (10Aklapper) [14:44:56] (03CR) 10Elukey: [V: 03+1 C: 03+2] statistics: remove leftovers for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/722270 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [14:45:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [14:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:37] (03PS1) 10Ottomata: Set thorium to role spare::system and remove references to thorium [puppet] - 10https://gerrit.wikimedia.org/r/724756 (https://phabricator.wikimedia.org/T292075) [14:47:48] (03PS2) 10Ottomata: Set thorium to role spare::system and remove references to thorium [puppet] - 10https://gerrit.wikimedia.org/r/724756 (https://phabricator.wikimedia.org/T292075) [14:47:55] (03CR) 10Ottomata: [C: 03+2] Remove thorium from linkrecommendation egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/724754 (https://phabricator.wikimedia.org/T285355) (owner: 10Ottomata) [14:51:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [14:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:36] (03PS1) 10Elukey: helmfile.d: skip helm3 namespace creation for ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/724757 (https://phabricator.wikimedia.org/T286791) [15:00:33] (03CR) 10Elukey: [C: 03+2] helmfile.d: skip helm3 namespace creation for ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/724757 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [15:00:46] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: Add irc notification for search-platform alerts [puppet] - 10https://gerrit.wikimedia.org/r/724752 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [15:02:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [15:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:57] RECOVERY - HP RAID on ms-be2036 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:04:34] jouncebot: nowandnext [15:04:35] No deployments scheduled for the next 2 hour(s) and 55 minute(s) [15:04:35] In 2 hour(s) and 55 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1800) [15:04:35] In 2 hour(s) and 55 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1800) [15:04:39] oh nice [15:05:08] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724558 (https://phabricator.wikimedia.org/T291962) (owner: 10Ladsgroup) [15:05:27] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2036 - device sdg:4 unavailable - https://phabricator.wikimedia.org/T291988 (10Papaul) 05Open→03Resolved a:03Papaul @fgiunchedi disk replaced [15:06:10] (03CR) 10jerkins-bot: [V: 04-1] Track time until dispatched recent changes are inserted [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724558 (https://phabricator.wikimedia.org/T291962) (owner: 10Ladsgroup) [15:07:54] Amir1: I’m thinking of backporting that WikimediaBadges fix, any objections? [15:08:10] Lucas_WMDE: sure, go ahead [15:08:13] ok [15:08:39] (03PS1) 10Lucas Werkmeister (WMDE): Handle missing items in WikibaseClientSiteLinksForItemHandler [extensions/WikimediaBadges] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724560 (https://phabricator.wikimedia.org/T291953) [15:08:52] (03PS1) 10Lucas Werkmeister (WMDE): Handle missing items in WikibaseClientSiteLinksForItemHandler [extensions/WikimediaBadges] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724561 (https://phabricator.wikimedia.org/T291953) [15:09:18] oh wait, you’re backporting something too? [15:09:22] then I’ll wait a bit [15:10:38] papaul: thank you re: ms-be2036, appreciate it! do you have also another drive for ms-be2035 by any chance ? [15:12:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [15:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:30] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2035 - device sdi:6 unavailable - https://phabricator.wikimedia.org/T291896 (10Papaul) 05Open→03Resolved a:03Papaul @fgiunchedi disk replaced [15:14:31] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10Papaul) All Firmware upgraded on the server [15:15:48] (03PS1) 10Jbond: icinga: add recheck_failed_services function [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 [15:16:35] RECOVERY - HP RAID on ms-be2035 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:17:51] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Papaul) @wiki_willy yes we do have 10 decom'd servers of the same type. [15:19:48] (03PS1) 10Filippo Giunchedi: o11y: port alertmanager alerts [alerts] - 10https://gerrit.wikimedia.org/r/724761 (https://phabricator.wikimedia.org/T288726) [15:21:38] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Create coolest-tool-academy mailing list for Coolest Tool Award - https://phabricator.wikimedia.org/T290511 (10Aklapper) [15:21:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [15:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:10] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2035.codfw.wmnet [15:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host thumbor2006.codfw.wmnet [15:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet completed: - thumbor2006 (*... [15:28:10] (03PS1) 10Jgreen: add frdb2003 to modules/icinga/templates/nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/724764 (https://phabricator.wikimedia.org/T290484) [15:28:28] (03CR) 10Volans: icinga: add recheck_failed_services function (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [15:29:12] (03CR) 10Jbond: "lgtm but see comments, nothing is blocking from my side" [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [15:29:14] (03Abandoned) 10Ppchelko: DNM: Demo of the changes between eventgate and common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/722656 (owner: 10Ppchelko) [15:29:25] (03CR) 10Jgreen: [C: 03+2] add frdb2003 to modules/icinga/templates/nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/724764 (https://phabricator.wikimedia.org/T290484) (owner: 10Jgreen) [15:30:30] RECOVERY - Device not healthy -SMART- on ms-be2035 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2035&var-datasource=codfw+prometheus/ops [15:31:12] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10Jgreen) [15:33:58] (03PS1) 10Ladsgroup: Enable change dispatching via jobs in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724765 (https://phabricator.wikimedia.org/T48643) [15:34:48] RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:06] (03Merged) 10jenkins-bot: Track time until dispatched recent changes are inserted [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724558 (https://phabricator.wikimedia.org/T291962) (owner: 10Ladsgroup) [15:39:06] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/Wikibase/client: Backport: [[gerrit:724558|Track time until dispatched recent changes are inserted (T291962)]] (duration: 01m 10s) [15:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:12] T291962: Track time between change is inserted into Recent Changes and time of change - https://phabricator.wikimedia.org/T291962 [15:39:51] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2035.codfw.wmnet [15:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:18] Amir1: do you want to deploy that config change too or can I do the backports? [15:40:41] does your backports have +2? [15:40:47] not yet [15:40:47] (03CR) 10Krinkle: [C: 04-1] "I think CANARY is too generic as a variable, and I think the servergroup would be more useful in its current function rather than be split" [puppet] - 10https://gerrit.wikimedia.org/r/724695 (https://phabricator.wikimedia.org/T291870) (owner: 10Hashar) [15:41:01] I’m waiting until you’re done d) [15:41:01] so +2 them, in the mean time I deploy the config change [15:41:02] * :) [15:41:12] I don’t know how long CI on WikimediaBadges takes [15:41:15] (03CR) 10Ladsgroup: [C: 03+2] Enable change dispatching via jobs in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724765 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [15:41:19] jouncebot: nowandnext [15:41:19] No deployments scheduled for the next 2 hour(s) and 18 minute(s) [15:41:19] In 2 hour(s) and 18 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1800) [15:41:19] In 2 hour(s) and 18 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1800) [15:41:22] there’s plenty of time left [15:41:24] (03CR) 10Michael Große: [C: 03+1] "I'll admit, I'm a bit nervous, but I'm sure it'll be fine 😅" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724765 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [15:42:09] (03Merged) 10jenkins-bot: Enable change dispatching via jobs in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724765 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [15:44:01] (03PS2) 10Jbond: icinga: add recheck_failed_services function [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 [15:44:30] !log pt1979@cumin2002 START - Cookbook sre.experimental.reimage for host thumbor2006.codfw.wmnet [15:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:35] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet [15:44:51] (03CR) 10Jbond: icinga: add recheck_failed_services function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [15:44:54] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724765|Enable change dispatching via jobs in wikidatawiki (T48643)]] (duration: 01m 08s) [15:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:00] T48643: [Story] Dispatching via job queue (instead of cron script) - https://phabricator.wikimedia.org/T48643 [15:45:01] !log disabled cron dispatching for mediawikiwiki [15:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:46] wikidatawiki or mediawikiwiki? [15:46:09] majavah: mediawikiwiki as a client of wikidatawiki [15:46:17] we are deploying partially [15:46:18] ahh [15:47:08] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:48:01] Lucas_WMDE: I'm done [15:48:06] ok thanks [15:48:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Handle missing items in WikibaseClientSiteLinksForItemHandler [extensions/WikimediaBadges] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724560 (https://phabricator.wikimedia.org/T291953) (owner: 10Lucas Werkmeister (WMDE)) [15:48:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Handle missing items in WikibaseClientSiteLinksForItemHandler [extensions/WikimediaBadges] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724561 (https://phabricator.wikimedia.org/T291953) (owner: 10Lucas Werkmeister (WMDE)) [15:49:04] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:49:28] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 398 probes of 569 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:50:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:27] (03CR) 10jerkins-bot: [V: 04-1] icinga: add recheck_failed_services function [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [15:53:35] 10SRE, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10fgiunchedi) [15:53:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:31] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 569 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:54:52] (03PS1) 10Filippo Giunchedi: icinga: remove alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/724771 (https://phabricator.wikimedia.org/T288726) [15:55:44] (03CR) 10Ladsgroup: "oh nice 😍" [puppet] - 10https://gerrit.wikimedia.org/r/724771 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [15:57:31] (03CR) 10Volans: "Yes, why not, we could do that. Please add to the docstring of wait_for_optimal some additional info about the new behaviour too." [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [15:58:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host thumbor2006.codfw.wmnet [15:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet executed with errors: - thu... [15:58:25] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:00:43] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:03:15] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:03:37] RECOVERY - Device not healthy -SMART- on ms-be2036 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2036&var-datasource=codfw+prometheus/ops [16:09:10] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724776 (owner: 10Michael Große) [16:11:22] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) Quick note: I had to add `createNamespace: false` to my service's `helmDefaults` to avoid the following error: ` Error: namespaces is forbidden: User "revscoring-editquality-dep... [16:18:07] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:49] (03CR) 10Hnowlan: [V: 03+1] cassandra: use FQDN in CN name for future instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [16:19:53] (03Merged) 10jenkins-bot: Handle missing items in WikibaseClientSiteLinksForItemHandler [extensions/WikimediaBadges] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724560 (https://phabricator.wikimedia.org/T291953) (owner: 10Lucas Werkmeister (WMDE)) [16:20:03] (03Merged) 10jenkins-bot: Handle missing items in WikibaseClientSiteLinksForItemHandler [extensions/WikimediaBadges] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724561 (https://phabricator.wikimedia.org/T291953) (owner: 10Lucas Werkmeister (WMDE)) [16:20:21] ^ I’ll test those two backports and then sync them [16:22:15] yup, fix works, let’s sync [16:24:01] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.1/extensions/WikimediaBadges/: Backport: [[gerrit:724560|Handle missing items in WikibaseClientSiteLinksForItemHandler (T291953)]] (duration: 01m 10s) [16:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:07] T291953: TypeError: Return value of WikimediaBadges\WikibaseClientSiteLinksForItemHandler::getItem() must be an instance of Wikibase\DataModel\Entity\Item, null returned - https://phabricator.wikimedia.org/T291953 [16:25:21] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/WikimediaBadges/: Backport: [[gerrit:724561|Handle missing items in WikibaseClientSiteLinksForItemHandler (T291953)]] (duration: 01m 08s) [16:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:04] alright, I’m done, deployment server is free as far as I’m concerned [16:33:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:18] !log pt1979@cumin2002 START - Cookbook sre.experimental.reimage for host thumbor2006.codfw.wmnet [16:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet [16:37:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:49] (03PS1) 10Jdlrobson: Search header should be vertically centered, not top aligned(take 2) [skins/MinervaNeue] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724787 (https://phabricator.wikimedia.org/T292071) [16:40:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:40:35] !log migrate kubemaster2001 off ganeti2007 and to ganeti2008 due to memory starvation on ganeti2007 [16:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:40:53] mailman seems indeed down [16:41:03] Yep [16:41:16] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:32] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:33] Or not [16:41:35] Seems back [16:41:43] majavah: did something crash [16:41:58] no clue [16:42:41] !log start hbal -L -G row_A -X on ganeti01.svc.codfw.wmnet [16:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:05] (03PS11) 10Jdlrobson: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [16:43:39] !log start hbal -L -G row_B -X on ganeti01.svc.codfw.wmnet . Rows C and D are fine [16:43:43] (03PS12) 10Jdlrobson: Use Wikimania's logo in a new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [16:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:54] jynus: ^ [16:44:10] ? [16:44:12] memory issues after all. should not happen for a while now that I am rebalancing the clusters [16:44:21] is this re:k8s? [16:44:24] or something else [16:44:25] yes [16:44:35] ah, I thoughy something else was going on :-) [16:44:35] kubemaster was on a node starved of memory [16:46:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10cmooney) Thanks for the time in the meeting today to discuss. From our chat and a few other things I've looked at we can say: - T... [16:47:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:40] do you kno if there is something from mailman on ganeti? there was a glitch a few minutes ago [16:47:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host thumbor2006.codfw.wmnet [16:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by pt1979@cumin2002 for host thumbor2006.codfw.wmnet completed: - thumbor2006 (*... [16:48:09] majavah: I think mailman node is on ganeti01 [16:48:09] I will check otherwise myself [16:48:13] lists1001 is a vm, but I don't think codfw rebalance should affect eqiad [16:48:17] ah, that that would explain it [16:48:45] let's see some graphs [16:51:16] akosiaris, in case it could be useful for ganeti, we have a non-paging memory alert for dbs, ready to add if necessary [16:51:56] (03CR) 10Eevans: "Forgive me if this is a stupid question, but the special-casing of 2023-{a,b,c} is because that host already has certs that use the FQDN? " [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [16:52:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:25] (03CR) 10Hnowlan: [V: 03+1] cassandra: use FQDN in CN name for future instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [16:59:25] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:00:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) [17:00:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) This is ready for service [17:01:15] (03CR) 10Eevans: "To the degree I understand this, it LGTM. 😊 I don't have access to the labs/private repo, and I didn't realize we created key material and" [labs/private] - 10https://gerrit.wikimedia.org/r/724717 (owner: 10Hnowlan) [17:02:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) Pretty much agree with everything you commented @cmooney Just a couple of clarifications: * the servers primary hostname... [17:02:19] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:03:07] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [17:03:09] (03CR) 10Eevans: [C: 03+1] cassandra: use FQDN in CN name for future instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [17:03:21] (03CR) 10Juan90264: [C: 03+1] "Thanks for grouping them together, now only a +2 Code-Review is missing. The other change I going to abandon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [17:03:51] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [17:04:13] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [17:05:23] (03Abandoned) 10Juan90264: Add optimised square logo and wordmark for Wikimania on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [17:09:05] jouncebot: nowandnext [17:09:05] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [17:09:06] In 0 hour(s) and 50 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1800) [17:09:06] In 0 hour(s) and 50 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1800) [17:09:09] cool [17:09:14] (03CR) 10Ladsgroup: [C: 03+2] Fully enable change dispatching via jobs on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724776 (owner: 10Michael Große) [17:09:56] (03CR) 10Hnowlan: secrets: Clean up restbase stub certificates (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/724717 (owner: 10Hnowlan) [17:10:05] (03Merged) 10jenkins-bot: Fully enable change dispatching via jobs on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724776 (owner: 10Michael Große) [17:13:12] !log ladsgroup@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:724776|Fully enable change dispatching via jobs on test wikis]], Part I (duration: 01m 07s) [17:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:42] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724776|Fully enable change dispatching via jobs on test wikis]], Part I (duration: 01m 09s) [17:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:13] (03PS1) 10Bartosz Dziewoński: Add a link to preferences within the Reply and New Discussion Tools [extensions/DiscussionTools] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724788 (https://phabricator.wikimedia.org/T291002) [17:26:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:40] (03PS2) 10Bartosz Dziewoński: Add a link to preferences within the Reply and New Discussion Tools [extensions/DiscussionTools] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724788 (https://phabricator.wikimedia.org/T291002) [17:31:35] (03PS1) 10Bartosz Dziewoński: Add a link to preferences within the Reply and New Discussion Tools [extensions/DiscussionTools] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724789 (https://phabricator.wikimedia.org/T291002) [17:31:53] (03PS2) 10Bartosz Dziewoński: Add a link to preferences within the Reply and New Discussion Tools [extensions/DiscussionTools] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724789 (https://phabricator.wikimedia.org/T291002) [17:40:30] (03PS2) 10Ppchelko: Clean up temporary variable wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) [17:46:04] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10ERayfield) wikitech.wikimedia.org - EllenR wikimedia developer accountUsername: ERayfield (WMF) [17:50:24] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) Seems like we have success :) Port is now up and MAC address learnt: ` cmooney@asw-a-codfw> show ethernet-switching table | match 1/... [17:55:58] (03CR) 10DLynch: [C: 03+1] "I guess the weirdest outcome might be if minor languages aren't translated by the time the train's cut, people might see it in their langu" [extensions/DiscussionTools] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724788 (https://phabricator.wikimedia.org/T291002) (owner: 10Bartosz Dziewoński) [17:59:08] * legoktm is live-hacking on mwdebug1001 [18:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1800) [18:00:05] ottomata, Jdlrobson, and Pchelolo: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:05] jeena and dduvall: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1800). [18:01:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10cmooney) Ok great @aborrero thanks for clarifying. That all 100% fits what I had in mind, so we are on the same page. I'll discus... [18:02:03] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:14] present for backport window [18:06:17] I got bounced out of IRC so I'm not sure who is running the backport window today? @ottomata / @Pchelolo is it either of you? [18:06:44] no one spoke up yet :) [18:07:27] okay thanks for clarifying @MatmaRex . The calendar needs a big update. I'm pretty sure @Niharika doesn't deploy any more :) [18:07:38] oh hello, i'm here [18:07:53] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for ksiebert - https://phabricator.wikimedia.org/T292053 (10Dzahn) Hi and welcome @KSiebert, you indicate you already do have shell access but while I can see your user KSiebert in LDAP I cannot see it in the shell access section. Is that maybe a different us... [18:08:02] i wasn't running it, i can do config changes but not backports so easily [18:08:20] Jdlrobson: is no one around to run it? [18:08:30] @MatmaRex , so no one has been stupid enough to admit they're the deployer yet ;) Evolution in progress [18:10:08] thcipriani: are you available? Also it looks like you maintain https://wikitech.wikimedia.org/wiki/User:DeploymentCalendarTool so can we drop Niharika's name from the templates as an available deployer (and possibly find a replacement? [18:10:10] I guess...we should ask RelEng to fix the calendar? [18:10:36] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2035 - device sdi:6 unavailable - https://phabricator.wikimedia.org/T291896 (10Dzahn) Icinga all green [18:11:44] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2036 - device sdg:4 unavailable - https://phabricator.wikimedia.org/T291988 (10Dzahn) Icinga all green >>! In T291988#7386881, @Joe wrote: > Nit sure why the alert only fired yesterday though, the bad sector errors have been ongoing for quite some time.... [18:11:47] the link to the git repo where patches should be sent is at the top of the Deployment calendar [18:12:27] legoktm: sure, but I'm not sure what the process is to identify new deployers [18:17:04] thanks for the ticket Jdlrobson [18:19:12] No worries. [18:19:18] The only time sensitive one is https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/724787 as it's a train blocker [18:20:37] (03CR) 10Dzahn: profile::gitlab start using gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [18:21:40] Jdlrobson: dancy is available [18:21:57] (03CR) 10Dzahn: profile::gitlab start using gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [18:22:20] Jdlrobson: I can deploy https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/724787 for you if it's ready to go. [18:23:10] (03CR) 10Dzahn: profile::gitlab start using gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [18:23:34] (03CR) 10Dzahn: profile::gitlab start using gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [18:24:21] (03CR) 10Eevans: [C: 03+1] secrets: Clean up restbase stub certificates (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/724717 (owner: 10Hnowlan) [18:24:41] (03CR) 10Bstorm: [C: 03+2] "This seems like a really good idea for modernizing the stack for node folks. I can try to squeeze in rebuilding the images in question tod" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/720788 (owner: 10Addshore) [18:25:19] (03Merged) 10jenkins-bot: Add yarn to node images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/720788 (owner: 10Addshore) [18:25:51] Thanks dancy! I'm checking in from the triage meeting but I can also deploy if needed. [18:26:46] according to the comment on the train blocker task the patch is ready [18:27:05] ok. hitting +2 [18:27:09] (03CR) 10Bstorm: [C: 03+1] "Sorry about the long wait on review or this. This would be great to at least try out." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [18:27:11] (03CR) 10Ahmon Dancy: [C: 03+2] Search header should be vertically centered, not top aligned(take 2) [skins/MinervaNeue] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724787 (https://phabricator.wikimedia.org/T292071) (owner: 10Jdlrobson) [18:27:16] (03CR) 10Dzahn: profile::gitlab start using gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [18:28:26] dancy: my extensions patches should be 100% safe to deploy whenever you are ready (and if you have time). to be safe (and because I have some meetings coming up), hold of on my config patch, and I can take care of that one later. I'll remove my config patch from the calendar [18:28:42] ok [18:28:54] ty [18:28:55] :) [18:30:11] thanks dancy [18:30:22] I can hang around to verify the fix [18:30:31] Ok. Will you be able to verify once it is on mwdebug? [18:31:43] yep [18:33:13] (03CR) 10Brennen Bearnes: profile::gitlab start using gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [18:34:00] (03CR) 10Dzahn: profile::gitlab start using gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [18:34:54] (03CR) 10Bstorm: [C: 03+1] Identify when venvs are for wrong Python versions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) (owner: 10Majavah) [18:36:15] (03CR) 10Dzahn: [C: 03+1] "Hey Brennen, I am wondering about the status here because the linked ticket says it is resolved but also "defunct" while this patch is not" [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [18:37:38] (03PS6) 10Dzahn: puppetmaster::geoip: install additional maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) [18:38:08] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::geoip: install additional maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [18:38:21] (03CR) 10Dzahn: "Why does jerkins hate?" [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [18:38:43] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Papaul) @wiki_willy bad news, mw2280 and the 10 decom'd servers all have same main board the only difference is the connector on the cable that connects the power button to the main board so we... [18:39:59] (03PS7) 10Dzahn: puppetmaster::geoip: install additional maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) [18:40:40] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::geoip: install additional maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [18:40:54] (03CR) 10Ssingh: cache::haproxy: Manage request/response headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [18:41:25] (03PS1) 10Herron: syslog::centralserver: run mtail service as group 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/724813 (https://phabricator.wikimedia.org/T292051) [18:41:36] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10wiki_willy) Ok, thanks for checking on that @Papaul. Before ordering a new motherboard, maybe the next step is to check with Service-Ops if they can afford to decom this server. Thanks, Willy [18:41:59] (03PS2) 10Herron: syslog::centralserver: run mtail service as group 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/724813 (https://phabricator.wikimedia.org/T292051) [18:42:21] (03CR) 10Brennen Bearnes: dev-images: migrate repository to gitlab remote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [18:42:42] 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Papaul) 05Open→03Resolved @Marostegui I checked the server today, all looks good . Resolving this task for now if we see the problem again we can re-open. thanks . [18:43:17] (03CR) 10Dzahn: [C: 03+1] dev-images: migrate repository to gitlab remote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [18:44:03] (03CR) 10Dzahn: "@Jelto this was +2ed but not submitted, I am submitting it now." [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [18:44:37] (03CR) 10Dzahn: "I am surprised that puppet did not fight with you over the remote... running the agent" [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [18:45:04] (03Merged) 10jenkins-bot: Search header should be vertically centered, not top aligned(take 2) [skins/MinervaNeue] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724787 (https://phabricator.wikimedia.org/T292071) (owner: 10Jdlrobson) [18:45:11] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724813 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [18:47:07] Jdlrobson: deployed to mwdebug1002 [18:47:08] (03CR) 10Dzahn: "yep, merged on master and no puppet change on contint2001. this repo is configured differently from others, it does not auto-submit after " [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [18:47:13] dancy: looking [18:50:48] (03CR) 10Brennen Bearnes: dev-images: migrate repository to gitlab remote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [18:50:50] (03PS1) 10Jgleeson: ssh: Include custom sshd_config files. [puppet] - 10https://gerrit.wikimedia.org/r/724816 [18:51:06] dancy: LGTM. Please sync. [18:51:39] button pressed... [18:52:34] !log dancy@deploy1002 Synchronized php-1.38.0-wmf.2/skins/MinervaNeue/resources/skins.minerva.base.styles/ui.less: Backport: [[gerrit:724787|Search header should be vertically centered, not top aligned(take 2) (T292071)]] (duration: 01m 08s) [18:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:41] T292071: [Visual bug] Mobile search header is misaligned - https://phabricator.wikimedia.org/T292071 [18:52:49] (03PS1) 10Legoktm: Fix passing temp directory to EasyTimeline.pl [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724790 [18:53:05] Jdlrobson: Deployed [18:54:00] I also have one backport I'd like to sneak in [18:54:16] (03PS1) 10RobH: updating skus and requirements [software] - 10https://gerrit.wikimedia.org/r/724817 [18:54:17] Go for it. jeena ^^ FYI [18:54:46] (03CR) 10RobH: [C: 03+2] updating skus and requirements [software] - 10https://gerrit.wikimedia.org/r/724817 (owner: 10RobH) [18:55:17] (03Merged) 10jenkins-bot: updating skus and requirements [software] - 10https://gerrit.wikimedia.org/r/724817 (owner: 10RobH) [18:55:19] Thanks dancy [18:55:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:31] okay [18:56:46] I will wait for legoktm [18:57:05] (03PS2) 10Legoktm: Fix passing temp directory to EasyTimeline.pl [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724790 [18:57:10] (03CR) 10Legoktm: [C: 03+2] Fix passing temp directory to EasyTimeline.pl [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724790 (owner: 10Legoktm) [18:57:12] (03PS1) 10Ryan Kemper: wcqs: add oauth dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/724818 [18:59:29] (03Abandoned) 10Ryan Kemper: wcqs: add oauth dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/724818 (owner: 10Ryan Kemper) [18:59:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] jeena and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1900). [19:01:38] (03Merged) 10jenkins-bot: Fix passing temp directory to EasyTimeline.pl [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724790 (owner: 10Legoktm) [19:02:21] testing on mwdebug100 [19:02:23] 1 [19:04:49] (03PS1) 10Ryan Kemper: wcqs: enable oauth [puppet] - 10https://gerrit.wikimedia.org/r/724821 (https://phabricator.wikimedia.org/T280006) [19:05:16] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724821 (https://phabricator.wikimedia.org/T280006) (owner: 10Ryan Kemper) [19:06:01] (03CR) 10Ppchelko: "This change is ready for review." [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724791 (https://phabricator.wikimedia.org/T291202) (owner: 10Ppchelko) [19:06:14] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/timeline/scripts/renderTimeline.sh: Fix passing temp directory to EasyTimeline.pl (duration: 01m 07s) [19:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:26] it's still not working, not sure why exactly since I have it fine locally, but synced it since it fixes part of it. [19:06:35] I'll figure it out in a bit [19:06:37] jeena: all yours [19:07:00] thanks legoktm [19:07:29] jeena: o/ [19:07:30] (03CR) 10Herron: [C: 03+2] syslog::centralserver: run mtail service as group 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/724813 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [19:08:10] I think we might want to merge https://gerrit.wikimedia.org/r/724791 first [19:09:11] I'm checking [19:13:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:50] (03CR) 10Ssingh: cache::haproxy: Configure sslcert::ocsp (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719471 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [19:15:24] dduvall: waiting to be able to merge that patch and then I'll proceed with the train [19:16:50] (03PS1) 10Ppchelko: Revert "Drop i18n messages for removed token API" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724792 [19:17:09] (03PS3) 10Ppchelko: Revert "Drop action api token methods deprecated in 1.24" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724791 (https://phabricator.wikimedia.org/T291202) [19:17:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:59] jeena: roger [19:25:05] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:27:11] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:33:42] (03PS1) 10Herron: warn on idle mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) [19:37:57] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "Drop action api token methods deprecated in 1.24" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724791 (https://phabricator.wikimedia.org/T291202) (owner: 10Ppchelko) [19:38:41] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "Drop i18n messages for removed token API" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724792 (owner: 10Ppchelko) [19:40:42] (03CR) 10Ejegg: [C: 03+1] "Looks good! This seems to be in the default distro-provided sshd_config so there should be no reason not to merge." [puppet] - 10https://gerrit.wikimedia.org/r/724816 (owner: 10Jgleeson) [19:50:40] (03PS1) 10Ebernhardson: query_service: Split oauth secret from other settings [puppet] - 10https://gerrit.wikimedia.org/r/724829 [19:50:42] (03PS1) 10Ebernhardson: query_service: Parameterize url redirected to after oauth success [puppet] - 10https://gerrit.wikimedia.org/r/724830 [19:51:58] (03PS8) 10Dzahn: puppetmaster::geoip: install additional maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) [19:52:32] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::geoip: install additional maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [19:52:50] (03PS1) 10Ebernhardson: query_service: Split consumer secret out of oauth_settings [labs/private] - 10https://gerrit.wikimedia.org/r/724831 [19:52:56] (03PS1) 10Ebernhardson: query_service: Remove non-secret values from secrets repo [labs/private] - 10https://gerrit.wikimedia.org/r/724832 [19:56:40] (03PS2) 10Ebernhardson: query_service: Split consumer secret out of oauth_settings [labs/private] - 10https://gerrit.wikimedia.org/r/724831 [19:56:46] (03PS2) 10Ebernhardson: query_service: Remove non-secret values from secrets repo [labs/private] - 10https://gerrit.wikimedia.org/r/724832 [19:57:13] (03CR) 10Ebernhardson: "Likely needs https://gerrit.wikimedia.org/r/c/labs/private/+/724831 first for the new key. I've already added the new key to wdqspuppet so" [puppet] - 10https://gerrit.wikimedia.org/r/724829 (owner: 10Ebernhardson) [19:57:29] (03Merged) 10jenkins-bot: Revert "Drop i18n messages for removed token API" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724792 (owner: 10Ppchelko) [19:57:48] (03PS9) 10Dzahn: puppetmaster::geoip: install additional maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) [19:58:14] (03Merged) 10jenkins-bot: Revert "Drop action api token methods deprecated in 1.24" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724791 (https://phabricator.wikimedia.org/T291202) (owner: 10Ppchelko) [20:00:04] jeena and dduvall: That opportune time is upon us again. Time for a MediaWiki train - American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T1900). [20:00:04] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T2000). [20:02:49] !log jhuneidi@deploy1002 Started scap: Fix pywikibot feature detection [20:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:15] (03PS10) 10Dzahn: puppetmaster::geoip: refactor to allow installing maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) [20:07:18] (03PS2) 10Ebernhardson: query_service: Parameterize url redirected to after oauth success [puppet] - 10https://gerrit.wikimedia.org/r/724830 [20:07:48] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::geoip: refactor to allow installing maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [20:08:03] (03CR) 10jerkins-bot: [V: 04-1] query_service: Parameterize url redirected to after oauth success [puppet] - 10https://gerrit.wikimedia.org/r/724830 (owner: 10Ebernhardson) [20:09:10] (03PS11) 10Dzahn: puppetmaster::geoip: refactor to allow installing maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) [20:09:49] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::geoip: refactor to allow installing maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [20:12:22] (03PS2) 10Ebernhardson: query_service: Split oauth secret from other settings [puppet] - 10https://gerrit.wikimedia.org/r/724829 [20:12:24] (03PS3) 10Ebernhardson: query_service: Parameterize url redirected to after oauth success [puppet] - 10https://gerrit.wikimedia.org/r/724830 [20:13:11] (03CR) 10jerkins-bot: [V: 04-1] query_service: Parameterize url redirected to after oauth success [puppet] - 10https://gerrit.wikimedia.org/r/724830 (owner: 10Ebernhardson) [20:13:21] (03PS3) 10Ryan Kemper: query_service: Split oauth secret from settings [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [20:16:27] !log jhuneidi@deploy1002 Finished scap: Fix pywikibot feature detection (duration: 13m 38s) [20:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:55] deploying to group1 now [20:16:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:11] (03PS3) 10Bearloga: statistics::product_analytics: create and prepare [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) [20:18:38] (03CR) 10Bearloga: statistics::product_analytics: create and prepare (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [20:19:11] (03PS1) 10Jeena Huneidi: group1 wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724833 [20:19:13] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724833 (owner: 10Jeena Huneidi) [20:20:01] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724833 (owner: 10Jeena Huneidi) [20:20:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:32] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.2 refs T281166 [20:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:38] T281166: 1.38.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T281166 [20:22:40] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.2 refs T281166 (duration: 01m 08s) [20:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:21] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [20:24:44] 10SRE-Access-Requests: Request access to superset for ifried - https://phabricator.wikimedia.org/T292118 (10ifried) [20:25:42] 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Marostegui) Thank you Papaul [20:25:50] (03CR) 10Herron: [C: 03+1] Prefer mx2001 for mail in ulsfo/eqsin [puppet] - 10https://gerrit.wikimedia.org/r/724338 (owner: 10Muehlenhoff) [20:27:09] 10SRE-Access-Requests: Request access to superset for ifried - https://phabricator.wikimedia.org/T292118 (10ifried) [20:28:45] 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10ifried) [20:28:55] (03PS1) 10Dduvall: Merge branch 'master' into train-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724834 [20:29:18] (03PS1) 10Bstorm: d/changelog: Prepare for 0.77 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/724836 [20:30:25] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:30:41] (03PS4) 10Bearloga: statistics::product_analytics: create and prepare [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) [20:30:42] jeena: so far so good [20:30:53] yeah I think so too [20:31:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:05] 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10ifried) [20:32:33] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:33:36] (03PS2) 10Bstorm: d/changelog: Prepare for 0.77 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/724836 [20:33:44] (03PS4) 10Ebernhardson: query_service: Parameterize url redirected to after oauth success [puppet] - 10https://gerrit.wikimedia.org/r/724830 (https://phabricator.wikimedia.org/T280006) [20:34:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:53] (03PS4) 10Ryan Kemper: query_service: Split oauth secret from settings [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [20:41:02] (03PS3) 10Ryan Kemper: query_service: Split consumer secret out of oauth_settings [labs/private] - 10https://gerrit.wikimedia.org/r/724831 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [20:41:45] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] query_service: Split consumer secret out of oauth_settings [labs/private] - 10https://gerrit.wikimedia.org/r/724831 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [20:42:31] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:42:43] (03PS5) 10Ryan Kemper: query_service: Split oauth secret from settings [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [20:43:01] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [20:44:39] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:45:58] (03PS3) 10Ryan Kemper: query_service: Remove non-secret values from secrets repo [labs/private] - 10https://gerrit.wikimedia.org/r/724832 (owner: 10Ebernhardson) [20:48:30] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [20:50:35] (03PS1) 10Dzahn: thanos: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724838 (https://phabricator.wikimedia.org/T266479) [20:51:49] I'm once again live hacking on mwdebug1001 [20:53:40] (03PS1) 10Dzahn: alertmanager: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724839 (https://phabricator.wikimedia.org/T266479) [20:54:10] (03CR) 10jerkins-bot: [V: 04-1] alertmanager: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724839 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [20:54:15] (03PS4) 10Ryan Kemper: query_service: Remove non-secret values from secrets repo [labs/private] - 10https://gerrit.wikimedia.org/r/724832 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [20:56:52] (03PS1) 10Dzahn: debdeploy/base: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724841 (https://phabricator.wikimedia.org/T266479) [21:00:25] just an update: the timeline fix from earlier *did* appear to fix everything, I was probably just looking at a cached copy [21:01:09] (03PS12) 10Dzahn: puppetmaster::geoip: refactor to allow installing maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) [21:01:40] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::geoip: refactor to allow installing maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [21:03:29] (03CR) 10Michael DiPietro: "If I'm seeing correctly we copy some bits of the git log into the changelog file before an update. Where is this used that git wouldn't be" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/724836 (owner: 10Bstorm) [21:03:47] (03PS13) 10Dzahn: puppetmaster::geoip: refactor to allow installing maxmind databases for IP Info [puppet] - 10https://gerrit.wikimedia.org/r/723337 (https://phabricator.wikimedia.org/T288844) [21:04:40] (03PS2) 10Dzahn: alertmanager: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/724839 (https://phabricator.wikimedia.org/T266479) [21:06:23] 10SRE, 10SRE-Access-Requests: Request access to private data group for ifried - https://phabricator.wikimedia.org/T292118 (10Iflorez) [21:06:38] (03CR) 10Bstorm: d/changelog: Prepare for 0.77 release (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/724836 (owner: 10Bstorm) [21:09:00] (03CR) 10Bstorm: "Since I use a Mac, I'm actually running "gbp dch --git-author..." using a function that calls a docker container, which Bryan and I docume" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/724836 (owner: 10Bstorm) [21:13:57] (03CR) 10Michael DiPietro: [C: 03+1] "Awesome. Thank you for the information." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/724836 (owner: 10Bstorm) [21:28:50] (03PS1) 10Legoktm: Bump CACHE_VERSION for ffa2ac0be55 [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724794 [21:29:34] (03CR) 10Legoktm: [C: 03+2] Bump CACHE_VERSION for ffa2ac0be55 [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724794 (owner: 10Legoktm) [21:30:20] (03CR) 10Bstorm: [C: 03+2] "Alright. I'm going to merge this and try a toolsbeta release." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/724836 (owner: 10Bstorm) [21:31:34] legoktm: https://phabricator.wikimedia.org/T292126 [21:32:29] * legoktm looks [21:34:16] (03Merged) 10jenkins-bot: Bump CACHE_VERSION for ffa2ac0be55 [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724794 (owner: 10Legoktm) [21:36:38] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.77 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/724836 (owner: 10Bstorm) [21:37:05] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/timeline/includes/Timeline.php: Bump Timeline::CACHE_VERSION (duration: 01m 08s) [21:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:44:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:18] (03PS1) 10Legoktm: Catch TimelineException from fixMap() [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724797 (https://phabricator.wikimedia.org/T292126) [21:50:24] (03CR) 10Legoktm: [C: 03+2] Catch TimelineException from fixMap() [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724797 (https://phabricator.wikimedia.org/T292126) (owner: 10Legoktm) [21:54:59] (03Merged) 10jenkins-bot: Catch TimelineException from fixMap() [extensions/timeline] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724797 (https://phabricator.wikimedia.org/T292126) (owner: 10Legoktm) [21:57:24] !log legoktm@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/timeline/includes/Timeline.php: Catch TimelineException from fixMap() (T292126) (duration: 01m 07s) [21:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:31] T292126: TimelineException: timeline-invalidmap - https://phabricator.wikimedia.org/T292126 [21:58:47] jeena: ^ should be fixed now [21:59:48] thanks legoktm [22:07:13] (03PS1) 10Dduvall: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/724857 [22:07:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:07:29] (03Abandoned) 10Dduvall: Merge branch 'master' into train-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724834 (owner: 10Dduvall) [22:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:27] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724829 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [22:10:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:05] (03PS1) 10Jdlrobson: Restore original more menu padding in legacy Vector [skins/Vector] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724798 (https://phabricator.wikimedia.org/T289163) [22:14:37] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:22:29] PROBLEM - SSH on ms-fe2006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:29:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:31:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:45:18] (03PS1) 10BryanDavis: toolhub: Bump container version to 2021-09-29-223524-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/724859 (https://phabricator.wikimedia.org/T292027) [22:55:03] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2021-09-29-223524-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/724859 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [22:56:52] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: session-187.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:31] RECOVERY - snapshot of s4 in codfw on alert1001 is OK: Last snapshot for s4 at codfw (db2139.codfw.wmnet:3314) taken on 2021-09-29 21:19:45 (1532 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:58:56] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2021-09-29-223524-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/724859 (https://phabricator.wikimedia.org/T292027) (owner: 10BryanDavis) [23:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210929T2300). [23:00:05] No Gerrit patches in the queue for this window AFAICS. [23:02:48] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [23:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:20] (03PS1) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [23:05:43] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [23:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:51] PROBLEM - very high load average likely xfs on ms-be2035 is CRITICAL: CRITICAL - load average: 106.27, 101.56, 93.89 https://wikitech.wikimedia.org/wiki/Swift [23:19:58] RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:09] !log bd808@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [23:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:39] (03PS2) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [23:23:12] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:23:25] RECOVERY - SSH on ms-fe2006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:28:19] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: session-200.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:32] (03CR) 10Ahmon Dancy: [C: 03+2] Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/724857 (owner: 10Dduvall) [23:33:57] (03PS3) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [23:34:26] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:34:48] (03Merged) 10jenkins-bot: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/724857 (owner: 10Dduvall) [23:35:54] (03CR) 10Cwhite: [C: 03+1] "Overall, LGTM. Some nits inline." [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [23:37:33] PROBLEM - LVS shellbox-syntaxhighlight eqiad port 4014/tcp - Shellbox SyntaxHighlight- shellbox-syntaxhighlight.svc.eqiad.wmnet IPv4 on shellbox-syntaxhighlight.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:38:44] (03PS4) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [23:39:14] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:39:33] RECOVERY - LVS shellbox-syntaxhighlight eqiad port 4014/tcp - Shellbox SyntaxHighlight- shellbox-syntaxhighlight.svc.eqiad.wmnet IPv4 on shellbox-syntaxhighlight.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 358 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:42:47] (03PS5) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [23:43:23] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:45:19] I'll look into the shellbox alert in a bit [23:46:09] thanks [23:46:52] Seems to have been a bigger spike in requests than I was expecting [23:46:57] https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-syntaxhighlight&var-release=main [23:47:07] I'm going to add more replicas shortly [23:50:56] (03PS6) 10Dzahn: geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) [23:51:37] (03CR) 10jerkins-bot: [V: 04-1] geoip: create transitional class geoip::data::maxmind::ipinfo [puppet] - 10https://gerrit.wikimedia.org/r/724860 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:53:41] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook