[00:06:42] PROBLEM - Check systemd state on ms-be2044 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:01:18] RECOVERY - Check systemd state on ms-be2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:14:54] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:15:58] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:25:50] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2044-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [03:30:14] !log run optimize table on db2140 for image table (T296143) [03:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:19] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [03:57:54] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [04:00:04] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [04:13:34] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:26:08] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms [04:27:54] RECOVERY - ElasticSearch shard size check - 9200 on logstash2002 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [06:22:19] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Marostegui) No more network flaps, so going to start repooling. [06:22:45] (03PS1) 10Marostegui: db1131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/740384 (https://phabricator.wikimedia.org/T295952) [06:23:47] (03CR) 10Marostegui: [C: 03+2] db1131: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/740384 (https://phabricator.wikimedia.org/T295952) (owner: 10Marostegui) [06:24:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17785 and previous config saved to /var/cache/conftool/dbconfig/20211122-062455-root.json [06:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:36] (03PS1) 10Marostegui: control-mariadb-10.4*: Update version [software] - 10https://gerrit.wikimedia.org/r/740386 (https://phabricator.wikimedia.org/T295970) [06:31:09] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4*: Update version [software] - 10https://gerrit.wikimedia.org/r/740386 (https://phabricator.wikimedia.org/T295970) (owner: 10Marostegui) [06:32:15] (03Merged) 10jenkins-bot: control-mariadb-10.4*: Update version [software] - 10https://gerrit.wikimedia.org/r/740386 (https://phabricator.wikimedia.org/T295970) (owner: 10Marostegui) [06:39:49] (03CR) 10Marostegui: "Thanks for sending this patch!" [puppet] - 10https://gerrit.wikimedia.org/r/739667 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [06:39:51] (03CR) 10Marostegui: [C: 03+2] mariadb: remove all grants related to scholarship app and its dumps [puppet] - 10https://gerrit.wikimedia.org/r/739667 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [06:39:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17786 and previous config saved to /var/cache/conftool/dbconfig/20211122-063959-root.json [06:40:01] (03PS2) 10Marostegui: mariadb: remove all grants related to scholarship app and its dumps [puppet] - 10https://gerrit.wikimedia.org/r/739667 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [06:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:19] !log Revoke dump grants for scholarships database T296166 [06:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:24] T296166: Remove scholarships grants from m2 - https://phabricator.wikimedia.org/T296166 [06:55:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17787 and previous config saved to /var/cache/conftool/dbconfig/20211122-065502-root.json [06:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 20%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17788 and previous config saved to /var/cache/conftool/dbconfig/20211122-071006-root.json [07:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2044-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [07:17:56] !log running optimize table on image table in commonswiki on codfw with replication enabled, it'll cause replication lag (T296143) [07:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:00] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [07:25:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17789 and previous config saved to /var/cache/conftool/dbconfig/20211122-072511-root.json [07:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance T296143 [07:27:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance T296143 [07:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:38] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [07:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2073.codfw.wmnet with reason: Maintenance T296143 [07:27:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2073.codfw.wmnet with reason: Maintenance T296143 [07:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2090.codfw.wmnet with reason: Maintenance T296143 [07:28:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2090.codfw.wmnet with reason: Maintenance T296143 [07:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance T296143 [07:28:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance T296143 [07:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance T296143 [07:28:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance T296143 [07:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance T296143 [07:28:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance T296143 [07:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance T296143 [07:28:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance T296143 [07:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance T296143 [07:28:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance T296143 [07:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance T296143 [07:28:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance T296143 [07:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance T296143 [07:28:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance T296143 [07:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:31] good morning Amir1 [07:29:32] :D [07:29:38] :D [07:30:01] I should use some grouping I guess [07:30:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance T296143 [07:30:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance T296143 [07:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:11] Amir1: pro-tip, the downtime cookbook (like various others) accept any cumin query as selector for the hosts to act on ;) [07:33:41] yeah, I was trying to come up a query for s4 hosts [07:33:55] not all of codfw dbs [07:34:51] I'll ask Manuel once he's back [07:37:44] worse case you can always use db20[01,12,56-78]* (random hosts) [07:38:19] (sorry on mobile can't easily grep puppet ;) ) [07:39:07] oh I didn't know it works like that too. Thanks! [07:39:18] that's fine [07:40:14] https://wikitech.wikimedia.org/wiki/Cumin#PuppetDB_host_selection [07:40:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 40%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17790 and previous config saved to /var/cache/conftool/dbconfig/20211122-074015-root.json [07:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:50] thanks. I missed the [] part when I was reading it [07:44:07] (03CR) 10Muehlenhoff: [C: 03+2] Show cluster name in conformation dialogue, not the master's name [cookbooks] - 10https://gerrit.wikimedia.org/r/740187 (owner: 10Muehlenhoff) [07:45:03] (03PS5) 10Muehlenhoff: wikimedia.org: add ldap-rw to replace ldap-labs [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [07:52:59] (03PS3) 10Giuseppe Lavagetto: Add apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/736273 (https://phabricator.wikimedia.org/T289224) [07:53:44] (03PS1) 10Elukey: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) [07:54:21] (03CR) 10jerkins-bot: [V: 04-1] profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [07:55:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17791 and previous config saved to /var/cache/conftool/dbconfig/20211122-075518-root.json [07:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:26] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Majavah) >>! In T296125#7518549, @AlexisJazz wrote: >>>! In T296125#75183... [07:59:28] (03PS2) 10Elukey: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) [08:01:00] (03CR) 10jerkins-bot: [V: 04-1] profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:01:02] (03CR) 10Majavah: profile::base::certificates: deploy wmf-certificates only in prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:02:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I agree with what this patch is supposed to do, but I have a doubt about the implementation." [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:02:31] (03CR) 10Elukey: profile::base::certificates: deploy wmf-certificates only in prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:04:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/736273 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [08:07:09] _joe_: \o/ [08:07:42] <_joe_> majavah: I hope to have it operating by this evening or tomorrow [08:08:04] <_joe_> majavah: I also hope this is the shortest living service we'll ever create :P [08:08:13] <_joe_> as I really want to retire that thing [08:10:13] (03Merged) 10jenkins-bot: Add apple-search deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/736273 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [08:10:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17792 and previous config saved to /var/cache/conftool/dbconfig/20211122-081022-root.json [08:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:46] (03CR) 10ArielGlenn: "To save you a bit of pain and extra work, I would ask you to wait until the move from crons to systemd timers is a bit more complete, and " [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) (owner: 10Urbanecm) [08:15:28] (03PS3) 10Elukey: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) [08:15:32] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apple-search' for release 'main' . [08:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:17] (03CR) 10Muehlenhoff: [C: 03+2] wikimedia.org: add ldap-rw to replace ldap-labs [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [08:17:57] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) (owner: 10Urbanecm) [08:19:32] (03PS4) 10Elukey: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) [08:20:41] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32536/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:21:16] (03PS1) 10Giuseppe Lavagetto: apple-search: fix image reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/740520 [08:23:37] (03PS5) 10Elukey: profile::base::certificates: deploy wmf-certificates only in prod [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) [08:24:52] (03CR) 10Urbanecm: [DNM] snapshot: Dump information about Growth mentorship (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) (owner: 10Urbanecm) [08:24:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32537/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:25:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: Repool after HW maintenance', diff saved to https://phabricator.wikimedia.org/P17793 and previous config saved to /var/cache/conftool/dbconfig/20211122-082525-root.json [08:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:17] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Marostegui) 05Open→03Resolved Host repooled - thanks DCOps for the fast response! [08:28:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32538/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:28:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32539/console" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:29:40] (03PS1) 10Urbanecm: ApiSetMentorStatus: Use READ_LATEST to request back timestamp [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740355 (https://phabricator.wikimedia.org/T295305) [08:30:02] (03CR) 10Urbanecm: "all the other fixes for M2 module are in wmf.9, makes sense for this patch to be there too" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740355 (https://phabricator.wikimedia.org/T295305) (owner: 10Urbanecm) [08:31:25] (03CR) 10Urbanecm: [C: 03+2] "deploying before train eventually rolls forward to be able to sync this more easily" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740355 (https://phabricator.wikimedia.org/T295305) (owner: 10Urbanecm) [08:31:48] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apple-search' for release 'main' . [08:31:50] (03CR) 10Elukey: [V: 03+1] profile::base::certificates: deploy wmf-certificates only in prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [08:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apple-search: fix image reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/740520 (owner: 10Giuseppe Lavagetto) [08:40:47] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "unreasonable CI complaints, passed in master" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740355 (https://phabricator.wikimedia.org/T295305) (owner: 10Urbanecm) [08:41:34] (03Merged) 10jenkins-bot: apple-search: fix image reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/740520 (owner: 10Giuseppe Lavagetto) [08:42:31] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apple-search' for release 'main' . [08:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:21] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/: 4418c4367b7420139cd8b30cb003d697b58c618f: ApiSetMentorStatus: Use READ_LATEST to request back timestamp (T295305) (duration: 01m 08s) [08:44:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bullseye [08:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:25] T295305: Mentor tools: In production/beta, submitting the away dialog causes a JavaScript error - https://phabricator.wikimedia.org/T295305 [08:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:17] (03CR) 10Jelto: [C: 03+2] gitlab: turn on Content-Security-Policy [puppet] - 10https://gerrit.wikimedia.org/r/737968 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [08:49:23] !log drain ganeti-test2003 for forthcoming reimage [08:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:19] <_joe_> uhm can't access gerrit via ssh [08:56:39] _joe_: works for me [08:56:53] <_joe_> ipv6 or ipv4? [08:57:07] ipv4 [08:57:21] <_joe_> yeah I can't either way it seems [08:57:22] works for me on both protocols [08:57:25] * _joe_ perlexed [08:57:59] _joe_: turn off and on your windows firewall [08:58:34] <_joe_> uhm nevermind, my problem is reaching bast3005 via ipv6 [08:58:48] <_joe_> now fixed [08:58:58] (03PS1) 10Giuseppe Lavagetto: apple-search: fix the private files location [deployment-charts] - 10https://gerrit.wikimedia.org/r/740522 [09:00:08] (03PS1) 10Giuseppe Lavagetto: image-suggestion: fix the private files position [deployment-charts] - 10https://gerrit.wikimedia.org/r/740523 [09:00:42] (03PS1) 10Majavah: Revert "Revert "dynamicproxy: add keystone token verification"" [puppet] - 10https://gerrit.wikimedia.org/r/740524 [09:01:32] good morning [09:05:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:56] !log installing 4.19.208-1 kernels on Stretch hosts with 4.19 kernels [09:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:30] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1125.eqiad.wmnet with OS bullseye [09:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:38] I have volunteered to co drive the train since our American colleagues celebrate thanksgiving this week (and are thus off thursday/friday) [09:08:43] will look at the blockers [09:08:50] (03CR) 10Mbch331: Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [09:11:38] (03PS1) 10Urbanecm: Growth: Disable filtering by unstarred mentees at enwiki, fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740525 (https://phabricator.wikimedia.org/T293182) [09:12:18] (03PS2) 10Urbanecm: Growth: Disable filtering by unstarred mentees at enwiki, fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740525 (https://phabricator.wikimedia.org/T293182) [09:12:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:32] * urbanecm quietly sneaks in 740525 [09:12:37] (03CR) 10Urbanecm: [C: 03+2] Growth: Disable filtering by unstarred mentees at enwiki, fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740525 (https://phabricator.wikimedia.org/T293182) (owner: 10Urbanecm) [09:12:46] we don't have VisualEditor on testwiki do we? [09:12:56] hashar: why shouldn't we? [09:13:05] VE is definitely at testwikis [09:13:08] I can't find the ve edit action in the web ui ;) [09:13:24] (03Merged) 10jenkins-bot: Growth: Disable filtering by unstarred mentees at enwiki, fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740525 (https://phabricator.wikimedia.org/T293182) (owner: 10Urbanecm) [09:13:26] "edit source", pen icon at the top right corner of the editor [09:13:43] as majavah says [09:13:46] majavah: that made it: [09:13:47] ! [09:13:51] or disable single edit tab in your preferences [09:13:59] I was kind of expecting "Edit" and "Edit source" as top tabs [09:14:01] (by default, it shows whatever you used the last time IIRC) [09:14:09] see https://www.mediawiki.org/wiki/VisualEditor/Single_edit_tab [09:14:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bullseye [09:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:12] indeed I had "remember my last editor" [09:15:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 24b3a7769ca97e3ed951d77d911f41afae5e4136: Growth: Disable filtering by unstarred mentees at arwiki, enwiki, fawiki (T293182) (duration: 01m 04s) [09:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:59] T293182: Expectation (readQueryRows <=) 10000 by MediaWiki::main not met (actual: 20736): query-m: SELECT gemm_mentee_id AS `value` FROM `growthexperiments_mentor_mentee` WHERE gemm_mentor_id = N - https://phabricator.wikimedia.org/T293182 [09:16:02] * urbanecm done with touching production, for now [09:17:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apple-search: fix the private files location [deployment-charts] - 10https://gerrit.wikimedia.org/r/740522 (owner: 10Giuseppe Lavagetto) [09:17:45] I have marked the VisualEditor blocker solved [09:21:38] (03Merged) 10jenkins-bot: apple-search: fix the private files location [deployment-charts] - 10https://gerrit.wikimedia.org/r/740522 (owner: 10Giuseppe Lavagetto) [09:23:25] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:24:17] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'apple-search' for release 'main' . [09:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:29] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:29:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:34] (03PS1) 10Filippo Giunchedi: install_server: fix partman config for new prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/740531 (https://phabricator.wikimedia.org/T294302) [09:34:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10fgiunchedi) Thanks @Papaul for taking care of this! I've noticed a typo in netboot.cfg for prometheus which I think... [09:35:29] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:35:55] !log installing Linux 4.9.272 updates on Stretch hosts [09:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:37] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:38:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:38:49] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1125.eqiad.wmnet with OS bullseye [09:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:35] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apple-search' for release 'main' . [09:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:51] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:42:34] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:44:11] (03CR) 10ArielGlenn: [DNM] snapshot: Dump information about Growth mentorship (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) (owner: 10Urbanecm) [09:45:33] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apple-search' for release 'main' . [09:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:50] (03CR) 10ArielGlenn: snapshot: replace the word cron everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [09:47:52] (03CR) 10Btullis: [C: 03+2] Update the way that the unavailable druid segment alert works [alerts] - 10https://gerrit.wikimedia.org/r/740128 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [09:51:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one note inline." [puppet] - 10https://gerrit.wikimedia.org/r/740531 (https://phabricator.wikimedia.org/T294302) (owner: 10Filippo Giunchedi) [09:53:14] (03CR) 10Ayounsi: [C: 03+1] sites: add new kubestage nodes [homer/public] - 10https://gerrit.wikimedia.org/r/739879 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [09:55:09] (03PS1) 10Ayounsi: Revert "prepend_as_out for esams/knams" [homer/public] - 10https://gerrit.wikimedia.org/r/740356 [10:01:35] (03CR) 10Marostegui: "Could we merge this sooner rather than later? Looks like reimages are broken at the moment 😊" [puppet] - 10https://gerrit.wikimedia.org/r/740531 (https://phabricator.wikimedia.org/T294302) (owner: 10Filippo Giunchedi) [10:04:28] (03PS2) 10Filippo Giunchedi: install_server: fix partman config for new prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/740531 (https://phabricator.wikimedia.org/T294302) [10:04:56] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: fix partman config for new prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740531 (https://phabricator.wikimedia.org/T294302) (owner: 10Filippo Giunchedi) [10:05:07] marostegui: will merge as soon as jenkins says yes [10:05:14] godog: \o/ thanks [10:05:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/740531 (https://phabricator.wikimedia.org/T294302) (owner: 10Filippo Giunchedi) [10:05:38] sure np [10:07:21] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:08:49] marostegui: all done [10:08:56] taking a break, brb [10:10:22] godog: did you run puppet on install servers? otherwise I'll do it now [10:11:41] godog: thanks! [10:12:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: No response from remote host 208.80.154.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:12:24] marostegui: puppet run completed on install servers, you can retry db1125 [10:12:31] moritzm: thank you! [10:12:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bullseye [10:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:23] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active, ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:16:03] (03PS1) 10Arturo Borrero Gonzalez: cloud: cinder-backups: fix configuration values [puppet] - 10https://gerrit.wikimedia.org/r/740535 (https://phabricator.wikimedia.org/T295584) [10:16:07] !log restart snmp gracefully cr2-eqord [10:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:39] moritzm: it went thru this time :) thank you [10:17:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:17:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:19:47] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:19:47] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:20:00] XioNoX: topranks: fyi ^^^ [10:20:02] (03PS1) 10Ayounsi: add `^omnibot/` to bad UAs [puppet] - 10https://gerrit.wikimedia.org/r/740536 [10:20:04] 302 _sec [10:20:15] yeah see _security [10:20:24] ack reading now [10:20:39] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/32540/console" [puppet] - 10https://gerrit.wikimedia.org/r/740535 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [10:20:43] moritzm: I did yeah re: puppet on install servers [10:21:01] * volans here [10:22:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/740536 (owner: 10Ayounsi) [10:22:35] (03PS1) 10Btullis: Change the analytics-hive CNAME to use the standby server [dns] - 10https://gerrit.wikimedia.org/r/740537 (https://phabricator.wikimedia.org/T295673) [10:22:37] (03CR) 10Ayounsi: [C: 03+2] add `^omnibot/` to bad UAs [puppet] - 10https://gerrit.wikimedia.org/r/740536 (owner: 10Ayounsi) [10:22:46] (Primary outbound port utilisation over 80% #page) firing: (2) Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:22:46] (Primary outbound port utilisation over 80% #page) firing: (2) Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:22:54] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/740536 (owner: 10Ayounsi) [10:22:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "We need a better message for ombibot, but let's first extinguish the fire" [puppet] - 10https://gerrit.wikimedia.org/r/740536 (owner: 10Ayounsi) [10:25:32] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1125.eqiad.wmnet with OS bullseye [10:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:25] (03PS11) 10David Caro: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [10:27:27] (03PS1) 10David Caro: WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) [10:27:36] (03PS1) 10Arturo Borrero Gonzalez: ceph: update default version to octopus [puppet] - 10https://gerrit.wikimedia.org/r/740540 [10:28:57] (03CR) 10David Caro: WIP cli: add --fail-fast flag and behavior (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [10:29:03] PROBLEM - BGP status on cr2-eqord is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:30:10] RECOVERY - BGP status on cr2-eqord is OK: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:30:17] (03CR) 10David Caro: WIP cli: add --fail-fast flag and behavior (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [10:30:19] (03PS1) 10Marostegui: packages_wmf.pp: Add bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/740541 (https://phabricator.wikimedia.org/T295965) [10:30:58] (03PS2) 10Arturo Borrero Gonzalez: ceph: update default version to octopus [puppet] - 10https://gerrit.wikimedia.org/r/740540 (https://phabricator.wikimedia.org/T296175) [10:31:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/740541 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [10:32:12] (03CR) 10Marostegui: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/740541 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [10:32:15] (03CR) 10Marostegui: [C: 03+2] packages_wmf.pp: Add bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/740541 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [10:33:18] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:33:47] (03PS1) 10Ema: varnish: match omnibot anywhere in UA, add contact info [puppet] - 10https://gerrit.wikimedia.org/r/740542 [10:34:30] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:34:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] varnish: match omnibot anywhere in UA, add contact info [puppet] - 10https://gerrit.wikimedia.org/r/740542 (owner: 10Ema) [10:35:37] (03CR) 10David Caro: "Just a question, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/740540 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [10:35:41] (03CR) 10Jbond: [C: 03+1] "LGTM but" [puppet] - 10https://gerrit.wikimedia.org/r/740542 (owner: 10Ema) [10:36:14] <_joe_> ema: hold a sec [10:36:19] <_joe_> I'm doing a check [10:36:23] ack [10:36:30] <_joe_> ema: done, you can proceed [10:36:44] <_joe_> all hits in our logs for "omnibot" are for the offender [10:36:49] jbond: you gave +1 and added "LGTM but", any concern? [10:38:00] (03CR) 10jerkins-bot: [V: 04-1] WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [10:39:00] RECOVERY - BGP status on cr2-eqord is OK: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:39:22] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:40:16] <_joe_> ema: I'd say merge for now, we're in a partial outage [10:41:03] (03CR) 10Elukey: [C: 03+1] Change the analytics-hive CNAME to use the standby server [dns] - 10https://gerrit.wikimedia.org/r/740537 (https://phabricator.wikimedia.org/T295673) (owner: 10Btullis) [10:41:12] (03CR) 10Ema: [C: 03+2] varnish: match omnibot anywhere in UA, add contact info [puppet] - 10https://gerrit.wikimedia.org/r/740542 (owner: 10Ema) [10:41:22] ema: ahh saw here the "but" was a typo please ignore [10:41:52] ack merging [10:41:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2003.codfw.wmnet with OS buster [10:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:06] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS buster [10:42:07] (03PS3) 10Arturo Borrero Gonzalez: ceph: update default version to octopus [puppet] - 10https://gerrit.wikimedia.org/r/740540 (https://phabricator.wikimedia.org/T296175) [10:43:08] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:43:18] (03PS4) 10Arturo Borrero Gonzalez: ceph: update default version to octopus [puppet] - 10https://gerrit.wikimedia.org/r/740540 (https://phabricator.wikimedia.org/T296175) [10:43:32] (03CR) 10Btullis: [C: 03+2] Change the analytics-hive CNAME to use the standby server [dns] - 10https://gerrit.wikimedia.org/r/740537 (https://phabricator.wikimedia.org/T295673) (owner: 10Btullis) [10:45:44] (03CR) 10David Caro: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/740540 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [10:45:55] (03CR) 10Arturo Borrero Gonzalez: ceph: update default version to octopus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740540 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [10:45:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bullseye [10:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:08] PROBLEM - Disk space on sodium is CRITICAL: DISK CRITICAL - free space: /boot 9 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sodium&var-datasource=eqiad+prometheus/ops [10:46:38] <_joe_> can someone look at sodium? [10:47:21] I can [10:47:46] (Primary outbound port utilisation over 80% #page) firing: (2) Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:47:46] (Primary outbound port utilisation over 80% #page) firing: (2) Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:49:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph: update default version to octopus [puppet] - 10https://gerrit.wikimedia.org/r/740540 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [10:49:42] !log `apt-get remove linux-image-4.9.0-5-amd64 linux-image-4.9.0-6-amd64` on sodium to free /boot [10:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:23] moritzm: o/ is it ok if I clean up sodium's boot even more? [10:52:04] elukey: sure, please go ahead! [10:52:05] (sodium's boot is at 76% usage now, the alert should recover soon-ish) [10:52:08] ack thanks! [10:52:31] (03CR) 10Hashar: [C: 03+2] "Preparation for the morning backport window." [skins/MinervaNeue] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740246 (https://phabricator.wikimedia.org/T296077) (owner: 10Nray) [10:52:44] anything except linux-image-4.9.0-16-amd64 and the currently running kernel can go away [10:52:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:52:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:53:23] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:54:20] moritzm: purged up to -14 [10:54:47] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:54:47] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [10:54:51] !log apt-get purge up to linux-image-4.9.0-14-amd64 on sodium to free /boot space [10:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:10] (03PS1) 10Jbond: public_cloud: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740545 [11:00:21] elukey: thx [11:00:33] (03PS1) 10JMeybohm: Reimplement hook to no longer call update-ca-certificates [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740547 (https://phabricator.wikimedia.org/T296127) [11:00:35] (03PS1) 10JMeybohm: Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740548 [11:02:27] (03PS2) 10Jbond: public_cloud: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740545 [11:06:01] (03PS3) 10Jbond: public_cloud: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740545 [11:06:09] RECOVERY - Disk space on sodium is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sodium&var-datasource=eqiad+prometheus/ops [11:08:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32546/console" [puppet] - 10https://gerrit.wikimedia.org/r/740545 (owner: 10Jbond) [11:09:32] (03PS4) 10Jbond: public_cloud: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740545 [11:09:39] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.00 ms [11:10:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test2003.codfw.wmnet with OS buster [11:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:21] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS buster completed: - ganeti-test2... [11:12:38] !log Revert "prepend_as_out for esams/knams" [11:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:46] (03CR) 10Ayounsi: [C: 03+2] Revert "prepend_as_out for esams/knams" [homer/public] - 10https://gerrit.wikimedia.org/r/740356 (owner: 10Ayounsi) [11:13:09] (03Merged) 10jenkins-bot: Fix banners to show CentralNotice [skins/MinervaNeue] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740246 (https://phabricator.wikimedia.org/T296077) (owner: 10Nray) [11:13:29] (03Merged) 10jenkins-bot: Revert "prepend_as_out for esams/knams" [homer/public] - 10https://gerrit.wikimedia.org/r/740356 (owner: 10Ayounsi) [11:16:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1125.eqiad.wmnet with OS bullseye [11:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:38] (03PS1) 10Ayounsi: Revert "disable LG ipv4 in knams" [homer/public] - 10https://gerrit.wikimedia.org/r/740361 [11:17:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2044-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [11:17:54] (03CR) 10Ayounsi: [C: 03+2] Revert "disable LG ipv4 in knams" [homer/public] - 10https://gerrit.wikimedia.org/r/740361 (owner: 10Ayounsi) [11:18:38] (03Merged) 10jenkins-bot: Revert "disable LG ipv4 in knams" [homer/public] - 10https://gerrit.wikimedia.org/r/740361 (owner: 10Ayounsi) [11:20:14] !log re-enable LibertyGlobal in esams [11:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [11:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:00] (03PS1) 10Giuseppe Lavagetto: Add apple-search VIPs [dns] - 10https://gerrit.wikimedia.org/r/740550 [11:24:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [11:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:47] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:25:49] (03CR) 10Volans: [C: 03+1] "LGTM, don't forget to run also the sre.dns.netbox cookbook ;)" [dns] - 10https://gerrit.wikimedia.org/r/740550 (owner: 10Giuseppe Lavagetto) [11:26:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2003.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [11:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:19] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation={get,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:26:42] (03PS1) 10Arturo Borrero Gonzalez: cloud: cinder-backups: use main ceph cinder keyring [puppet] - 10https://gerrit.wikimedia.org/r/740551 (https://phabricator.wikimedia.org/T292546) [11:26:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti-test2003.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [11:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:33] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:28:21] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:32:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add apple-search VIPs [dns] - 10https://gerrit.wikimedia.org/r/740550 (owner: 10Giuseppe Lavagetto) [11:34:32] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Restarting to pick up Java security updates - hnowlan@cumin1001 [11:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:50] !log oblivian@cumin1001 START - Cookbook sre.dns.netbox [11:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:23] (03PS5) 10Jbond: public_cloud: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740545 [11:39:11] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:57] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:40:06] (03PS6) 10Jbond: public_cloud: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740545 [11:41:05] !log oblivian@cumin1001 START - Cookbook sre.dns.netbox [11:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32548/console" [puppet] - 10https://gerrit.wikimedia.org/r/740545 (owner: 10Jbond) [11:43:07] !log installing krb5 security updates on stretch [11:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:08] !log oblivian@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:55] !log installing qemu security updates on bullseye [11:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:48] (03PS1) 10Btullis: Failback hive services to the designated master [dns] - 10https://gerrit.wikimedia.org/r/740552 (https://phabricator.wikimedia.org/T295673) [11:57:59] (03PS1) 10Arturo Borrero Gonzalez: cinder: fix config template and don't reuse 'ceph_pool' that much [puppet] - 10https://gerrit.wikimedia.org/r/740554 (https://phabricator.wikimedia.org/T292546) [11:58:34] (03CR) 10jerkins-bot: [V: 04-1] cinder: fix config template and don't reuse 'ceph_pool' that much [puppet] - 10https://gerrit.wikimedia.org/r/740554 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [11:59:33] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211122T1200). [12:00:04] inductiveload, James_F, and Hashar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:15] o/ [12:00:23] \o [12:00:24] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [12:00:35] * James_F waves. [12:00:56] hi, I got a patch for MinervaNeue which is already merged, I will deploy it after you aall are done with patche s;) [12:01:43] looks like inductiveload’s changes are backports, so they’ll need a few minutes in CI; maybe it would make sense if you deploy your already merged patch right away, hashar? [12:01:44] actually if we're still rolled-back, some of my patches may not apply [12:01:54] sure [12:01:56] oh, true, we are [12:02:03] only group0 is on wmf.9 [12:02:16] ok, so maybe just do 740216 which doesn't need .9 [12:02:19] I plan to push wmf.9 to rest of wikis this afternoon [12:02:32] You're meant to make the cherry-picks for the deployer. [12:03:31] oh right [12:03:43] sorry, I have only done config patches till now [12:03:56] so what branch should this be on? [12:04:01] So you want them for wmf.8 and wmf.9, I think. [12:04:14] Well, all for wmf.9 and it sounds like only one for wmf.8 [12:04:25] And they might depend on each other so they'll need to be stacked? [12:04:36] *wmf.7 and wmf.9 [12:04:40] wmf.8 never got deployed [12:04:44] Yeah, sorry. [12:05:17] ok, let me just do the perf one, and I'll rejig the .9 ones and get them ready for tmr [12:05:21] if they are not big issues, feel free to skip wmf.7 [12:05:36] sounds good inductiveload [12:05:47] I like the LinkBatch one [12:07:05] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10MoritzMuehlenhoff) The following upgrade steps were taken towards 2.16. After going through the upgrade again, it turns out the update procedure attempted in t... [12:07:15] hashar: are you deploying the MinervaNeue patch? (I don’t see you logged in on deploy1002) [12:07:23] I will [12:07:29] ok [12:07:35] then I’ll wait for that [12:08:10] (03PS3) 10Hnowlan: partmon: add reuse partmon profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) [12:08:32] scap pull on mwdebug1001 [12:08:35] testing [12:08:47] ah, now I see you in w/who too ^^ [12:09:13] for sure I see the banner now ;) [12:09:35] \o/ [12:09:44] (03PS1) 10Inductiveload: Lua: use LinkBatch to speed up the template dependencies [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740556 (https://phabricator.wikimedia.org/T296092) [12:10:01] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/740556 [12:10:04] (03CR) 10Btullis: [C: 03+2] Failback hive services to the designated master [dns] - 10https://gerrit.wikimedia.org/r/740552 (https://phabricator.wikimedia.org/T295673) (owner: 10Btullis) [12:10:04] like that? [12:10:12] [ERROR] "scap sync" has been renamed to "scap sync-world". [12:10:13] pfff [12:10:34] nooo don’t sync-world D: [12:11:25] !log hashar@deploy1002 Synchronized php-1.38.0-wmf.9/skins/MinervaNeue: Fix banners to show CentralNotice - T296077 (duration: 01m 04s) [12:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:29] T296077: CentralNotice banners not showing in Minerva - https://phabricator.wikimedia.org/T296077 [12:11:44] inductiveload: looks good (if you want it on wmf.7 as well you’ll need another backport) [12:11:53] sure one mo [12:12:10] (03PS1) 10Inductiveload: Lua: use LinkBatch to speed up the template dependencies [extensions/ProofreadPage] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/740558 (https://phabricator.wikimedia.org/T296092) [12:12:13] I am done with the MinervaNeue backport (success) [12:12:19] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/740558 [12:12:20] for the .7 [12:12:44] alright, let’s merge .9 first [12:12:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Lua: use LinkBatch to speed up the template dependencies [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740556 (https://phabricator.wikimedia.org/T296092) (owner: 10Inductiveload) [12:13:04] and I will push wmf.9 to all wikis this afternoon, sometime shortly after the backport window [12:13:25] Zuul predicts the merge will take 21 minutes… [12:13:38] :-\ [12:13:40] James_F: do you want to deploy your config change in the meantime? [12:13:45] (ExtensionDistributor) [12:16:00] Lucas_WMDE: Sure. [12:16:07] alright, go ahead :) [12:16:11] Ack. [12:16:29] (03PS3) 10Jforrester: ExtensionDistributor: 1.37.0 is out now, so there's no beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739861 (https://phabricator.wikimedia.org/T289585) [12:16:34] (03CR) 10Jforrester: [C: 03+2] ExtensionDistributor: 1.37.0 is out now, so there's no beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739861 (https://phabricator.wikimedia.org/T289585) (owner: 10Jforrester) [12:17:01] Also Beta's HTTPS cert having expired is unhelpful. :-) [12:17:09] yeah :( [12:17:22] (03Merged) 10jenkins-bot: ExtensionDistributor: 1.37.0 is out now, so there's no beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739861 (https://phabricator.wikimedia.org/T289585) (owner: 10Jforrester) [12:17:34] Lucas_WMDE: BTW, crossing the streams a bit, but: `curl 185.15.56.74:6927/_info` [12:18:33] neat :) [12:18:44] no rDNS but mtr looks like it’s in wmcloud at least ^^ [12:18:51] Yeah, it's on deployment-prep. [12:18:53] inductiveload: it looks like this doGetIndexProgress Lua thing is only used on bnwikisource,enwikisource,hiwikisource? (according to mwgrep) [12:19:01] Specifically, deployment-docker-wikifunctions01. [12:19:03] so I guess the wmf.9 backport isn’t really testable, actually [12:19:22] because only angwikisource and htwikisource are in group0 [12:19:46] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: ExtensionDistributor: 1.37.0 is out now, so there's no beta T289585 (duration: 01m 04s) [12:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:50] T289585: Release MW 1.37.0 - https://phabricator.wikimedia.org/T289585 [12:19:52] do .7 first and I can try on enWS? [12:20:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Lua: use LinkBatch to speed up the template dependencies [extensions/ProofreadPage] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/740558 (https://phabricator.wikimedia.org/T296092) (owner: 10Inductiveload) [12:20:16] alright [12:20:33] (I'm all done in prod.) [12:20:37] and then wmf.9 will become more relevant once the train rolls forward [12:20:39] ack, thans [12:20:41] *thanks [12:20:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:01] (03PS2) 10Arturo Borrero Gonzalez: cinder: fix config template and don't reuse 'ceph_pool' that much [puppet] - 10https://gerrit.wikimedia.org/r/740554 (https://phabricator.wikimedia.org/T292546) [12:22:17] (03PS3) 10Arturo Borrero Gonzalez: cinder: fix config template and don't reuse 'ceph_pool' that much [puppet] - 10https://gerrit.wikimedia.org/r/740554 (https://phabricator.wikimedia.org/T292546) [12:23:02] lunch, will be back later for the train [12:23:08] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Lucas_Werkmeister_WMDE) [12:24:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:24:52] (03PS1) 10Arturo Borrero Gonzalez: ceph: codfw: refresh entry name for codfw1dev-cinder-backups [labs/private] - 10https://gerrit.wikimedia.org/r/740562 (https://phabricator.wikimedia.org/T292546) [12:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:02] (03PS1) 10Inductiveload: Use the WikiEditor ready hook instead of using() the lib [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740563 (https://phabricator.wikimedia.org/T296033) [12:29:48] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:30:30] ^ already went down again on grafana AFAICT [12:31:14] (03Merged) 10jenkins-bot: Lua: use LinkBatch to speed up the template dependencies [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740556 (https://phabricator.wikimedia.org/T296092) (owner: 10Inductiveload) [12:31:16] (peaked at 3s) [12:31:28] alright, waiting for the other merge… [12:32:57] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] ceph: codfw: refresh entry name for codfw1dev-cinder-backups [labs/private] - 10https://gerrit.wikimedia.org/r/740562 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:34:56] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:35:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:34] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: backups: refresh entry for ceph keyring [labs/private] - 10https://gerrit.wikimedia.org/r/740564 (https://phabricator.wikimedia.org/T292546) [12:36:53] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] codfw1dev: backups: refresh entry for ceph keyring [labs/private] - 10https://gerrit.wikimedia.org/r/740564 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:38:15] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/32552/" [puppet] - 10https://gerrit.wikimedia.org/r/740554 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [12:38:49] (03Merged) 10jenkins-bot: Lua: use LinkBatch to speed up the template dependencies [extensions/ProofreadPage] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/740558 (https://phabricator.wikimedia.org/T296092) (owner: 10Inductiveload) [12:39:08] ayy \o/ [12:39:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:50] test on debug1002? [12:39:56] inductiveload: the wmf.7 change should be on mwdebug1001, can you test it? [12:40:11] (note, mwdebug1001 not 2) [12:40:16] righto [12:41:38] testing it myself, purging https://en.wikisource.org/wiki/Wikisource:Community_collaboration/Monthly_Challenge/November_2021 seems to be somewhat faster on mwdebug1001 than without x-wikimedia-debug [12:41:46] though it still takes a few seconds [12:41:57] yeah it's a big-ass page [12:43:01] the guy running the the challenge this month has been throwing tons of stuff in [12:43:25] good to deploy? [12:44:08] it certainly looks faster to me [12:44:11] and it's not dead AFACT [12:44:16] s/AFACT/AFAICT/ [12:44:25] alright, starting sync [12:45:48] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.7/extensions/ProofreadPage/includes/Pagination/Pagination.php: Backport: [[gerrit:740558|Lua: use LinkBatch to speed up the template dependencies (T296092)]] (1/2) (duration: 01m 04s) [12:45:49] (03CR) 10Jhernandez: [C: 03+1] "I've tested the associative array config locally and I didn't see any issues in QuickSurveys, so it should work fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [12:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:53] T296092: ProofreadPage: improve performance of Index stats Lua - https://phabricator.wikimedia.org/T296092 [12:46:01] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10Papaul) @fgiunchedi Thanks for the fix. [12:46:04] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:47:14] thank you [12:47:16] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.7/extensions/ProofreadPage/includes/ProofreadPageLuaLibrary.php: Backport: [[gerrit:740558|Lua: use LinkBatch to speed up the template dependencies (T296092)]] (2/2) (duration: 01m 03s) [12:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:28] ok, and now wmf.9 [12:47:32] probably without further testing [12:48:08] sadly both ang and ht WS are closed [12:48:37] but I can keep an eye on it when the .9 goes out later [12:49:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:49:40] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/ProofreadPage/includes/Pagination/Pagination.php: Backport: [[gerrit:740556|Lua: use LinkBatch to speed up the template dependencies (T296092)]] (1/2) (duration: 01m 04s) [12:49:42] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:03] alright, thanks [12:50:19] and thanks for updating the deployment calendar btw :) [12:50:23] and we can defer the other .9 patches for now [12:50:27] :-) [12:51:03] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/ProofreadPage/includes/ProofreadPageLuaLibrary.php: Backport: [[gerrit:740556|Lua: use LinkBatch to speed up the template dependencies (T296092)]] (2/2) (duration: 01m 03s) [12:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:07] T296092: ProofreadPage: improve performance of Index stats Lua - https://phabricator.wikimedia.org/T296092 [12:51:28] alright, then I think we’re done [12:51:35] !log UTC morning backport+config window done [12:51:37] yes, thank you :-) [12:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:17] (03CR) 10Jbond: "I have given a quick first pass but will need another one. Also when this is ready it would be good to get another python reviewer as ii " [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [12:56:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740547 (https://phabricator.wikimedia.org/T296127) (owner: 10JMeybohm) [12:56:43] (03CR) 10Jbond: [C: 03+1] Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740548 (owner: 10JMeybohm) [13:01:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:03:48] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:04:02] !log asw-b-codfw# set virtual-chassis member 7 mastership-priority 255 - T295118 [13:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:06] T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 [13:05:53] Lucas_WMDE: well done! [13:06:37] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: deploy general cinder keyring in cinder-backups nodes [puppet] - 10https://gerrit.wikimedia.org/r/740579 (https://phabricator.wikimedia.org/T292546) [13:11:58] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: deploy general cinder keyring in cinder-backups nodes [puppet] - 10https://gerrit.wikimedia.org/r/740579 (https://phabricator.wikimedia.org/T292546) [13:14:52] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) The above command doesn't commit on a pre-provisioned VC. I did this instead: ` [edit virtual-chassis member 2] - role routing-engine; +... [13:15:38] jouncebot: now [13:15:38] No deployments scheduled for the next 3 hour(s) and 14 minute(s) [13:15:43] 3.14 :D [13:17:07] nice ^^ [13:17:24] Amir1, marostegui, duesen, jynus, akosiaris, _joe_: I am going to roll 1.38.0-wmf.9 to all wikis. The database queries surge should be fixed ( T296063 ) so maybe the memory issue will not show up. [13:17:25] T296063: 4x increase in database queries after deploy of 1.38.0-wmf.9 to all wikis - https://phabricator.wikimedia.org/T296063 [13:17:30] fingers crossed [13:17:48] hashar: thanks for the heads up [13:17:49] hashar: good luck! [13:18:09] (03PS1) 10Hashar: all wikis to 1.38.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740581 [13:18:11] (03CR) 10Hashar: [C: 03+2] all wikis to 1.38.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740581 (owner: 10Hashar) [13:19:06] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740581 (owner: 10Hashar) [13:19:23] * hashar whistles [13:20:33] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.9 [13:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:13] hashar: I didn't come across an obvious explanation for the memory surge while investigating the DB query issue. Let's hope for the best [13:23:47] Is there a bot that reminds us to do whatever we do for good luck prior to deploying? :D [13:24:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2005.codfw.wmnet with OS bullseye [13:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:14] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host prometheus2005.codfw.wmnet with OS bull... [13:25:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice job!" [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [13:26:13] !bash  Is there a bot that reminds us to do whatever we do for good luck prior to deploying? :D [13:26:13] Amir1: Stored quip at https://bash.toolforge.org/quip/WDbTR30B8Fs0LHO5Watc [13:27:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [13:28:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:14] (03PS1) 10Filippo Giunchedi: pontoon: fix puppetdb role name [puppet] - 10https://gerrit.wikimedia.org/r/740582 [13:29:40] (03CR) 10Filippo Giunchedi: "(catching up with backlog post-VAC) thank you for the heads up!" [puppet] - 10https://gerrit.wikimedia.org/r/701931 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [13:30:18] !log re-enabling V6 between cr2-codfw and asw-b-codfw - T295118 [13:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:21] T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 [13:31:59] !log re-enable puppet on lvs2007 [13:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:54] !log re-enable pybal on lvs2007 - T295118 [13:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:30] RECOVERY - PyBal backends health check on lvs2007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:33:38] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:33:44] RECOVERY - pybal on lvs2007 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:35:02] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 104, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:35:28] PROBLEM - puppet last run on lvs2007 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:36:50] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:37:47] (Juniper alarm active) resolved: Juniper alarm active - https://alerts.wikimedia.org [13:38:12] RECOVERY - PyBal connections to etcd on lvs2007 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:38:56] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:38:56] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10akosiaris) We 've discussed this in last week's meeting. For now, it looks like we will not be replacing this hardware as it is not worth it. We will need howev... [13:42:04] (03PS1) 10Phuedx: beta: Grant sysops the ipinfo-view-basic right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740587 (https://phabricator.wikimedia.org/T295912) [13:43:00] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus2005.codfw.wmnet with OS bullseye [13:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:09] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host prometheus2005.codfw.wmnet with OS bullseye... [14:05:09] RECOVERY - puppet last run on lvs2007 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:05:21] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=parsoid,name=wtp1025.eqiad.wmnet [14:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:43] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=parsoid,name=wtp1041.eqiad.wmnet [14:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:09] !log repool wtp1025, wtp1041 to parsoid cluster. T296098 [14:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:12] T296098: 1.38.0-wmf.9 seems to have introduced a memory leak - https://phabricator.wikimedia.org/T296098 [14:07:13] (03CR) 10Elukey: [C: 03+1] "Tested the script, works nicely. Left a nit about inline documentation (but it could be me not caffeinated enough)." [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740547 (https://phabricator.wikimedia.org/T296127) (owner: 10JMeybohm) [14:07:38] (03CR) 10Elukey: [C: 03+1] Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740548 (owner: 10JMeybohm) [14:10:08] jouncebot: now [14:10:08] No deployments scheduled for the next 2 hour(s) and 19 minute(s) [14:10:55] I'm going to merge and sync a Beta Cluster only change [14:11:10] phuedx: you don't need to sync it, just rebase it [14:11:34] the deployment to beta cluster itself is automated and happens every ten minutes (unless the build breaks) [14:14:10] Amir1: I'm never sure if I should scap-sync -labs.php-only changes. I've previously sync'ed them but happy to just rebase for now [14:15:12] whatever you prefer, it's just they are completely different worlds [14:15:20] (03PS2) 10JMeybohm: Reimplement hook to no longer call update-ca-certificates [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740547 (https://phabricator.wikimedia.org/T296127) [14:15:22] (03PS2) 10JMeybohm: Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740548 [14:16:06] (03PS1) 10Andrew Bogott: wmf-sink (designate): fix a crash when we fail to find a DNS record for a VM [puppet] - 10https://gerrit.wikimedia.org/r/740592 (https://phabricator.wikimedia.org/T296144) [14:16:40] (03CR) 10JMeybohm: Reimplement hook to no longer call update-ca-certificates (031 comment) [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740547 (https://phabricator.wikimedia.org/T296127) (owner: 10JMeybohm) [14:17:45] (03CR) 10Phuedx: [C: 03+2] beta: Grant sysops the ipinfo-view-basic right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740587 (https://phabricator.wikimedia.org/T295912) (owner: 10Phuedx) [14:18:49] (03Merged) 10jenkins-bot: beta: Grant sysops the ipinfo-view-basic right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740587 (https://phabricator.wikimedia.org/T295912) (owner: 10Phuedx) [14:22:18] Alright. Rebased [14:23:48] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Restarting to pick up Java security updates - hnowlan@cumin1001 [14:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:46] (03CR) 10Andrew Bogott: [C: 03+2] wmf-sink (designate): fix a crash when we fail to find a DNS record for a VM [puppet] - 10https://gerrit.wikimedia.org/r/740592 (https://phabricator.wikimedia.org/T296144) (owner: 10Andrew Bogott) [14:28:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:09] oh joy I got a conflict in grafana editing :( [14:38:25] (03PS5) 10Majavah: opentack: add keystone auth to remaining proxy api users [puppet] - 10https://gerrit.wikimedia.org/r/740306 (https://phabricator.wikimedia.org/T295234) [14:39:34] (03CR) 10Andrew Bogott: [C: 03+2] opentack: add keystone auth to remaining proxy api users [puppet] - 10https://gerrit.wikimedia.org/r/740306 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [14:40:18] 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10Majavah) [14:43:53] jouncebot: nowandnext [14:43:53] No deployments scheduled for the next 1 hour(s) and 46 minute(s) [14:43:53] In 1 hour(s) and 46 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211122T1630) [14:43:57] noice [14:44:20] (03PS2) 10Ladsgroup: Disable DPL on Wikisources where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734424 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:44:28] Amir1: Go go go. [14:44:31] Death to DPL. [14:44:32] Etc. [14:44:36] (03CR) 10Ladsgroup: [C: 03+2] Disable DPL on Wikisources where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734424 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:44:43] :D [14:44:57] !log jelto@cumin1001 START - Cookbook sre.ganeti.makevm for new host gitlab-runner1001.wikimedia.org [14:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:10] (03PS1) 10Filippo Giunchedi: pontoon: tmp remove base::puppet for duplicate declaration? [puppet] - 10https://gerrit.wikimedia.org/r/740595 [14:46:12] (03PS1) 10Filippo Giunchedi: install_server: tweak reserved space for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/740596 (https://phabricator.wikimedia.org/T294302) [14:46:45] (03PS2) 10Filippo Giunchedi: install_server: tweak reserved space for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/740596 (https://phabricator.wikimedia.org/T294302) [14:46:55] (03Merged) 10jenkins-bot: Disable DPL on Wikisources where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734424 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:48:13] (03PS2) 10Ladsgroup: Disable DPL on Wikiversities where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734425 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:48:18] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: tweak reserved space for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/740596 (https://phabricator.wikimedia.org/T294302) (owner: 10Filippo Giunchedi) [14:48:20] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "dynamicproxy: add keystone token verification"" [puppet] - 10https://gerrit.wikimedia.org/r/740524 (owner: 10Majavah) [14:48:22] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:734424|Disable DPL on Wikisources where not in use (T287916)]] (duration: 00m 56s) [14:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:26] T287916: Disable DPL on wikis that aren't using it - https://phabricator.wikimedia.org/T287916 [14:48:29] (03CR) 10Ladsgroup: [C: 03+2] Disable DPL on Wikiversities where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734425 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:48:36] andrewbogott: I'll merge your change too [14:48:41] thanks! [14:48:49] sure np, {{done}} [14:49:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:47] (03Merged) 10jenkins-bot: Disable DPL on Wikiversities where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734425 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:50:51] (03PS2) 10Ladsgroup: Disable DPL on opt-in wikis where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734426 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:51:36] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:734425|Disable DPL on Wikiversities where not in use (T287916)]] (duration: 00m 56s) [14:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:04] (03CR) 10Ladsgroup: [C: 03+2] Disable DPL on opt-in wikis where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734426 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:53:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:21] (03Merged) 10jenkins-bot: Disable DPL on opt-in wikis where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734426 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:54:37] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/740597 [14:54:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2006.codfw.wmnet with OS bullseye [14:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host prometheus2006.co... [14:55:49] (03Abandoned) 10Ladsgroup: Disable DPL on wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710482 (https://phabricator.wikimedia.org/T287916) (owner: 10Ladsgroup) [14:55:54] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:734426|Disable DPL on opt-in wikis where not in use (T287916)]] (duration: 00m 56s) [14:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:58] T287916: Disable DPL on wikis that aren't using it - https://phabricator.wikimedia.org/T287916 [14:57:10] (03PS5) 10Giuseppe Lavagetto: profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) [14:57:12] (03PS1) 10Giuseppe Lavagetto: service::catalog: add apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740598 (https://phabricator.wikimedia.org/T289224) [14:57:14] (03PS1) 10Giuseppe Lavagetto: apple-search: move to lvs setup [puppet] - 10https://gerrit.wikimedia.org/r/740599 (https://phabricator.wikimedia.org/T289224) [14:57:16] (03PS1) 10Giuseppe Lavagetto: apple-search: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/740600 (https://phabricator.wikimedia.org/T289224) [14:57:18] (03PS1) 10Giuseppe Lavagetto: apple-search: promote to production [puppet] - 10https://gerrit.wikimedia.org/r/740601 (https://phabricator.wikimedia.org/T289224) [14:58:04] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:58:26] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/740602 [14:58:31] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab-runner1001.wikimedia.org [14:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:00] (03PS2) 10Giuseppe Lavagetto: service::catalog: add apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740598 (https://phabricator.wikimedia.org/T289224) [14:59:02] (03PS2) 10Giuseppe Lavagetto: apple-search: move to lvs setup [puppet] - 10https://gerrit.wikimedia.org/r/740599 (https://phabricator.wikimedia.org/T289224) [14:59:04] (03PS2) 10Giuseppe Lavagetto: apple-search: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/740600 (https://phabricator.wikimedia.org/T289224) [14:59:06] (03PS2) 10Giuseppe Lavagetto: apple-search: promote to production [puppet] - 10https://gerrit.wikimedia.org/r/740601 (https://phabricator.wikimedia.org/T289224) [15:00:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32553/console" [puppet] - 10https://gerrit.wikimedia.org/r/740598 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [15:02:10] (03PS1) 10Jelto: site and install_server: add gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) [15:02:59] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/740605 [15:03:06] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:03:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one note inline." [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:06:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] service::catalog: add apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740598 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [15:06:49] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/740582 (owner: 10Filippo Giunchedi) [15:07:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:44] (03CR) 10Jbond: [C: 03+1] Reimplement hook to no longer call update-ca-certificates [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740547 (https://phabricator.wikimedia.org/T296127) (owner: 10JMeybohm) [15:08:15] (03PS1) 10Majavah: librenms: replace librenms-readers with wmf/nda [puppet] - 10https://gerrit.wikimedia.org/r/740626 (https://phabricator.wikimedia.org/T295700) [15:08:58] (03PS3) 10Giuseppe Lavagetto: apple-search: move to lvs setup [puppet] - 10https://gerrit.wikimedia.org/r/740599 (https://phabricator.wikimedia.org/T289224) [15:09:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10Papaul) [15:11:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apple-search: move to lvs setup [puppet] - 10https://gerrit.wikimedia.org/r/740599 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [15:11:16] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Reimplement hook to no longer call update-ca-certificates [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740547 (https://phabricator.wikimedia.org/T296127) (owner: 10JMeybohm) [15:11:21] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740548 (owner: 10JMeybohm) [15:12:50] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Jdforrester-WMF) p:05Triage→03Unbreak! Within the context of the Beta... [15:13:28] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10Ottomata) Approved, this will need kerberos access as well. https://wikitech.wikimedia.org/wiki/Analytics/Data_acce... [15:13:53] <_joe_> !log restarting pybal low-traffic in codfw, eqiad [15:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:14] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:16:01] !log imported wmf-certificates 0~20211122-1 to stretch-wikimedia,buster-wikimedia,bullseye-wikimedia [15:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:37] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Urbanecm) >>! In T296125#7520450, @Jdforrester-WMF wrote: > Within the co... [15:17:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2044-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [15:17:19] !log set kvm:machine_version=pc-i440fx-2.8 for Ganeti cluster in codfw T294119 [15:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:23] T294119: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 [15:17:34] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:18:46] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Jdforrester-WMF) [15:19:32] (03CR) 10AOkoth: [C: 03+2] sites: add new kubestage nodes [homer/public] - 10https://gerrit.wikimedia.org/r/739879 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [15:20:09] (03Merged) 10jenkins-bot: sites: add new kubestage nodes [homer/public] - 10https://gerrit.wikimedia.org/r/739879 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [15:21:16] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 66 connections established with conf2004.codfw.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [15:22:10] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.68:4013]) https://wikitech.wikimedia.org/wiki/PyBal [15:22:17] <_joe_> that's me, expected [15:22:20] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 75 connections established with conf1004.eqiad.wmnet:4001 (min=76) https://wikitech.wikimedia.org/wiki/PyBal [15:22:41] (03PS1) 10Vgutierrez: role:cache: Provide a text_haproxy role [puppet] - 10https://gerrit.wikimedia.org/r/740628 (https://phabricator.wikimedia.org/T290005) [15:23:11] (03PS1) 10Filippo Giunchedi: Revert "Include base::puppet in profile::puppetmaster::pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/740630 [15:24:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [15:26:42] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 67 connections established with conf2004.codfw.wmnet:4001 (min=67) https://wikitech.wikimedia.org/wiki/PyBal [15:27:16] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "Include base::puppet in profile::puppetmaster::pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/740630 (owner: 10Filippo Giunchedi) [15:27:30] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:27:38] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 76 connections established with conf1004.eqiad.wmnet:4001 (min=76) https://wikitech.wikimedia.org/wiki/PyBal [15:27:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus2006.codfw.wmnet with OS bullseye [15:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host prometheus2006.codfw.... [15:28:52] !log revoking DROP for wikiadmin from db1100 (T249683) [15:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:55] T249683: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 [15:29:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [15:29:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10Papaul) [15:30:02] (03PS1) 10ArielGlenn: fix up arg processing for Enterprise downloader script [puppet] - 10https://gerrit.wikimedia.org/r/740632 (https://phabricator.wikimedia.org/T273585) [15:32:10] (03PS1) 10Filippo Giunchedi: install_server: tweak reserved space for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/740633 (https://phabricator.wikimedia.org/T294302) [15:32:24] (03PS2) 10ArielGlenn: fix up arg processing for Enterprise HTML dumps downloader script [puppet] - 10https://gerrit.wikimedia.org/r/740632 (https://phabricator.wikimedia.org/T273585) [15:32:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10Papaul) 05Open→03Resolved This is complete on on end. [15:34:48] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:37:46] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [15:38:54] (03CR) 10ArielGlenn: [C: 03+2] fix up arg processing for Enterprise HTML dumps downloader script [puppet] - 10https://gerrit.wikimedia.org/r/740632 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [15:39:07] (03PS8) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [15:39:22] (03CR) 10JMeybohm: profile::base::certificates: deploy wmf-certificates only in prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [15:39:35] (03CR) 10Jbond: [C: 03+1] librenms: replace librenms-readers with wmf/nda [puppet] - 10https://gerrit.wikimedia.org/r/740626 (https://phabricator.wikimedia.org/T295700) (owner: 10Majavah) [15:41:00] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Majavah) >>! In T296125#7520534, @Stashbot wrote: > {nav icon=file, name=... [15:42:17] (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/740626 (https://phabricator.wikimedia.org/T295700) (owner: 10Majavah) [15:42:23] (03CR) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [15:42:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apple-search: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/740600 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [15:42:36] (03PS3) 10Giuseppe Lavagetto: apple-search: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/740600 (https://phabricator.wikimedia.org/T289224) [15:51:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/740626 (https://phabricator.wikimedia.org/T295700) (owner: 10Majavah) [15:55:11] (03CR) 10Ema: "A few comments inline, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/740545 (owner: 10Jbond) [15:55:40] (03CR) 10Jbond: [C: 03+2] librenms: replace librenms-readers with wmf/nda [puppet] - 10https://gerrit.wikimedia.org/r/740626 (https://phabricator.wikimedia.org/T295700) (owner: 10Majavah) [15:55:42] !log Telia DDoS auto-mitigation enabled on all circuits - T288926 [15:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:46] cdanis: ^ [15:55:57] XioNoX: <3 [15:56:06] I am on clinic duty this week, requesting to have the channel's topic to be updated with my name :) [15:56:34] (03PS3) 10Giuseppe Lavagetto: apple-search: promote to production [puppet] - 10https://gerrit.wikimedia.org/r/740601 (https://phabricator.wikimedia.org/T289224) [15:56:43] we should probably expand the ops list of this channel a bit, https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/irc/ircservserv-config/+/refs/heads/master/channels/wikimedia-operations.toml [15:56:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:11] mmandere: does this work? [15:58:23] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: fix puppetdb role name [puppet] - 10https://gerrit.wikimedia.org/r/740582 (owner: 10Filippo Giunchedi) [15:58:43] majavah: I'd just give +o to any SRE/deployer on request, tbh. [15:59:02] urbanecm: Yes that works, thank you :) [15:59:12] any time! [16:01:16] 10SRE, 10Platform Engineering, 10Traffic, 10Patch-For-Review, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Umherirrender) >>!... [16:04:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apple-search: promote to production [puppet] - 10https://gerrit.wikimedia.org/r/740601 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [16:10:39] 10SRE, 10Community-Tech, 10LDAP-Access-Requests: Grant Access to Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MMandere) Hi @samwilson: Can you please clarify as per https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Dashboards_in_Superset_/_Hive_interfaces_(like_Hue)_that_do_a... [16:13:53] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Majavah) [16:14:37] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Majavah) 05Open→03Resolved a:03Majavah [16:14:41] (03CR) 10BBlack: [C: 03+1] Add ownership annotations for additional Traffic services [puppet] - 10https://gerrit.wikimedia.org/r/738262 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [16:15:12] !log installing postgresql-13 security updates on bullseye [16:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:30] !log Password reset for Miraki@arbcom_dewiki per private request [16:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:45] (03PS1) 10Muehlenhoff: Add library hint for postgresql-13 [puppet] - 10https://gerrit.wikimedia.org/r/740636 [16:20:17] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for postgresql-13 [puppet] - 10https://gerrit.wikimedia.org/r/740636 (owner: 10Muehlenhoff) [16:23:11] (03PS2) 10Herron: thanos: add recording rules for varnish SLO [puppet] - 10https://gerrit.wikimedia.org/r/740209 (https://phabricator.wikimedia.org/T289615) [16:26:00] (03CR) 10Jbond: "LGTM most comments are optional/nits but please do switch to stdlib::ensure" [puppet] - 10https://gerrit.wikimedia.org/r/740389 (https://phabricator.wikimedia.org/T296127) (owner: 10Elukey) [16:29:33] (03CR) 10Joal: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/740233 (owner: 10Mforns) [16:30:05] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211122T1630). [16:31:19] (03PS2) 10Herron: thanos: add experimental varnish multiwindow recording rules [puppet] - 10https://gerrit.wikimedia.org/r/740211 [16:34:47] (03PS3) 10Herron: thanos: add experimental varnish multiwindow recording rules [puppet] - 10https://gerrit.wikimedia.org/r/740211 [16:39:16] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [16:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:21] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [16:40:10] (03CR) 10Herron: "initial sketch for multiwindow recording rules to be used by burn rate alerting rules." [puppet] - 10https://gerrit.wikimedia.org/r/740211 (owner: 10Herron) [16:41:42] (03PS1) 10Vgutierrez: site: Reimage cp4032 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/740638 (https://phabricator.wikimedia.org/T290005) [16:44:46] !log T295705 Upgrading `relforge` elasticsearch packages: `ryankemper@cumin1001:~$ sudo cumin -b 2 'relforge*' 'DEBIAN_FRONTEND=noninteractive sudo apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install elasticsearch-oss wmf-elasticsearch-search-plugins'` [16:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:51] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [16:46:09] (03CR) 10Vgutierrez: [C: 03+2] role:cache: Provide a text_haproxy role [puppet] - 10https://gerrit.wikimedia.org/r/740628 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:49:19] !log [Elastic] T295705 Downtimed relforge* for 2 hours in order to performing a manual rolling restart of the two hosts `relforge1003` and `relforge1004` [16:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:37] !log depol cp4032 to be reimaged as cache::text_haproxy - T290005 [16:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:40] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [16:51:15] !log fleet wide updated wmf-certificates to 0~20211122-1 [16:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:31] (03PS1) 10Giuseppe Lavagetto: Add discovery for apple-search [dns] - 10https://gerrit.wikimedia.org/r/740640 [16:52:47] !log [Elastic] T295705 Restarting first relforge host: `ryankemper@relforge1004:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service` [16:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:51] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [16:53:19] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4032 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/740638 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:55:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4032.ulsfo.wmnet with OS buster [16:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:31] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4032.ulsfo.wmnet with OS buster [16:55:50] PROBLEM - Host kubestage1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:53] !log [Elastic] T295705 Restarting second and final relforge host: `ryankemper@relforge1003:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service` [16:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:00] RECOVERY - Host kubestage1004 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:58:14] !log [Elastic] T295705 Rolling restart w/ plugin upgrade of `relforge` is complete [16:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:18] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [17:00:34] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 [17:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:17] !log T295705 Beginning rolling restart w/ plugin upgrade of `cloudelastic`: `ryankemper@cumin1001:~$ sudo cookbook sre.elasticsearch.rolling-operation cloudelastic "cloudelastic plugin upgrade + restart" --upgrade --nodes-per-run 3 --start-datetime 2021-11-22T16:59:38 --task-id T295705` on tmux `rolling_restarts_cloudelastic` [17:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:49] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10Legoktm) Sorry about delays - I was going to create the mailing lists just now and realized they don't fit our guidelines for names (https://meta.wikimedia.org/wiki/Mailing_lis... [17:06:20] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10MMandere) @Daimona You have been added to the `wmf` group. Please let us know if there are any questions, thank you! [17:06:47] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10MMandere) 05Open→03Resolved [17:08:47] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10RhinosF1) Doesn't https://github.com/wikimedia/puppet/blob/24cc1258080076d140eaf705b26f0f8ac63563c0/modules/admin/data/data.yaml#L4416 need updating with WMF email to match LDAP? [17:08:49] (03PS1) 10Majavah: hieradata: Route search.wm.o to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) [17:11:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add discovery for apple-search [dns] - 10https://gerrit.wikimedia.org/r/740640 (owner: 10Giuseppe Lavagetto) [17:23:44] 10SRE, 10Platform Engineering, 10Traffic, 10Patch-For-Review, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10MdsShakil) I tryin... [17:24:16] (03CR) 10Volans: "Replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [17:29:31] (03CR) 10Dzahn: "I think we don't actually have to put these in wikimedia.org in the public VLAN. We could do this in eqiad.wmnet. That means having to rec" [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:32:28] !log restart both elasticsearch instances on elastic2044, reporting `connection refused` (after a brief period of `no route to host`) to masters even though the connection works outside elastic [17:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:26] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4839 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:36:47] o.O [17:37:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2044-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [17:37:12] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [17:37:26] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=54&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-30m&to=now looks bad [17:38:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11] - https://phabricator.wikimedia.org/T294297 (10Papaul) [17:41:20] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3387 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:41:37] I see a spike of cirrussearch-too-busy in logstash [17:41:47] looking [17:42:05] which matches the timing of the php-fpm alert above [17:42:22] Note there's an ongoing rolling restart of `codfw`, although usually it doesn't trigger latency warnings [17:43:02] they are all on wikidata it seems [17:43:13] (03PS1) 10Ayounsi: Revert "Depool codfw due to IPV6 connectivity issues" [dns] - 10https://gerrit.wikimedia.org/r/740612 [17:43:18] (03PS2) 10Ayounsi: Revert "Depool codfw due to IPV6 connectivity issues" [dns] - 10https://gerrit.wikimedia.org/r/740612 [17:43:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:58] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:46:55] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4032.ulsfo.wmnet with OS buster [17:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:06] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4032.ulsfo.wmnet with OS buster c... [17:47:32] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool codfw due to IPV6 connectivity issues" [dns] - 10https://gerrit.wikimedia.org/r/740612 (owner: 10Ayounsi) [17:48:13] !log repool codfw [17:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:19] Based off of https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?orgId=1 it looks like codfw elasticsearch is overloaded. Pool counter's rejecting about ~half of current requests [17:50:37] Confirmed that I see `An error has occurred while searching: Search is currently too busy. Please try again later.` when trying the search box. Going to pause the rolling restart and let the cluster settle [17:50:53] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [17:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:57] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [17:52:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: (2) Elasticsearch instance elastic2044-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [17:56:42] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) 05Resolved→03Open [17:57:00] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) Yes, that. and also needs removal from "nda". [17:58:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) So we had some unexpected consequences over the weekend following this change. Example mail from ISP below: ` > Cc'ing Wikimedia NOC. > > We have... [17:59:40] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) 05Open→03Resolved nda removal was already done and updating email would reveal realname, something that was avoided in this case before. closing again. [18:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211122T1800). [18:01:43] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10RhinosF1) Email can be found in LDAP and name is quoted in the task above. @Daimona: does your thoughts on real name still apply? [18:02:24] ryankemper: the API cluster looks good again, was the rolling restart just too aggressive? [18:03:54] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:04:04] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 [18:04:06] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [18:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:24] (03CR) 10Dzahn: [C: 03+2] planet: Add TheresNoTime's blog to en [puppet] - 10https://gerrit.wikimedia.org/r/740376 (owner: 10Samtar) [18:04:48] legoktm: I think that's likely. Our clusters are a bit underprovisioned relative to where they should be (this will improve in Q3 when we scale up host count by ~38%) [18:05:04] cause I don't see an unusual increase in requests or anything, just the spike in rejections itself [18:07:34] mutante: ah, sorry about that topic name for 740376, set via the branch name? [18:09:45] tn: ah, you must be Sam:) yes, the local branch name turns into the "topic" in Gerrit web UI. dont worry at all, few people do that, it's just my thing that I like to use it to group patches. gets you a single link to watch all the related ones.. like planet-feeds or access-requests [18:10:23] ah okay, makes sense, thank you :) [18:10:26] https://gerrit.wikimedia.org/r/q/topic:%22planet-feeds%22+(status:open%20OR%20status:merged) [18:10:29] like this [18:10:53] (03PS1) 10Vgutierrez: prometheus::ops: Add cache::text_haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/740652 (https://phabricator.wikimedia.org/T290005) [18:11:30] tn: myself I don't care about the local branch name either, I just click "edit" in the UI after uploading [18:11:49] or that little "x", not "edit", actually [18:12:09] I'll bear that in mind, most other patches I've done have been directly related to a phab task so its often been named after that ^^ [18:13:11] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32554/console" [puppet] - 10https://gerrit.wikimedia.org/r/740652 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [18:13:52] tn: yep, depends on the context. makes sense if it's a specific bug. I just do that where it's an ongoing thing, requests for additions [18:13:53] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] prometheus::ops: Add cache::text_haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/740652 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [18:14:13] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:46] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [18:35:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) 05Open→03Resolved Codfw repooled, everything is back to normal. [18:46:02] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:58:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11] - https://phabricator.wikimedia.org/T294297 (10Papaul) [18:59:34] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10Papaul) [19:00:05] RoanKattouw and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211122T1900). [19:00:05] nn1l2, dontpanic, cjming, mbch331, and inductiveload: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:26] Hello [19:00:37] Hey, a lot of backport clients signed up! [19:01:10] I can deploy in ~20 if no one else's around [19:01:29] !log pool cp4032 (text) using HAProxy as TLS terminator - T290005 [19:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:34] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [19:05:11] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:52] hey [19:09:19] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [19:09:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [19:11:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:26] * urbanecm waves to everyone [19:11:29] (03PS1) 10Andrew Bogott: nova_fixed_multi: improved logging for unexpected exceptions [puppet] - 10https://gerrit.wikimedia.org/r/740659 (https://phabricator.wikimedia.org/T296144) [19:11:44] hey nn1l2, dontpanic, mbch331 and inductiveload: are you still around for the window? [19:11:50] Yes [19:11:51] yes [19:11:52] yup [19:11:56] excellent! [19:12:04] I'll start with nn1l2's patches then [19:12:09] thnks [19:12:16] \o [19:12:27] (03PS2) 10Urbanecm: Enable mapframe on the Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738547 (https://phabricator.wikimedia.org/T295571) (owner: 104nn1l2) [19:12:31] (03CR) 10Urbanecm: [C: 03+2] Enable mapframe on the Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738547 (https://phabricator.wikimedia.org/T295571) (owner: 104nn1l2) [19:13:12] nn1l2: honestly i think we should just enable sandboxlink everywhere. Can be a nice topic for GRFC I guess [19:13:19] (03Merged) 10jenkins-bot: Enable mapframe on the Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738547 (https://phabricator.wikimedia.org/T295571) (owner: 104nn1l2) [19:13:31] (ofc i'll deploy this one, just saying) [19:13:43] Thanks [19:13:46] nn1l2: the mapframe patch is at mwdebug1001, please test [19:13:52] (03PS2) 10Urbanecm: Enable SandboxLink on lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740186 (https://phabricator.wikimedia.org/T296073) (owner: 104nn1l2) [19:13:57] (03CR) 10Urbanecm: [C: 03+2] Enable SandboxLink on lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740186 (https://phabricator.wikimedia.org/T296073) (owner: 104nn1l2) [19:14:23] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Papaul) 05Open→03Resolved Decommission compete [19:14:31] (03CR) 10Andrew Bogott: [C: 03+2] nova_fixed_multi: improved logging for unexpected exceptions [puppet] - 10https://gerrit.wikimedia.org/r/740659 (https://phabricator.wikimedia.org/T296144) (owner: 10Andrew Bogott) [19:14:37] LGTM https://id.wikipedia.org/wiki/Pembicaraan_Pengguna:4nn1l2/Sand [19:14:38] dontpanic: do you know how to test your patch please? [19:14:43] (not ready yet, asking for when it is) [19:14:48] nn1l2: thanks, syncing [19:14:59] (03Merged) 10jenkins-bot: Enable SandboxLink on lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740186 (https://phabricator.wikimedia.org/T296073) (owner: 104nn1l2) [19:15:01] I think so, isn't it just finding a page with a number and seeing if it's translated? [19:15:09] 10SRE, 10ops-codfw, 10decommission-hardware, 10serviceops: decommission thumbor200[12].codfw.wmnet - https://phabricator.wikimedia.org/T273141 (10Papaul) [19:15:25] dontpanic: i'm afraid it won't be that easy, by reading https://www.mediawiki.org/wiki/Manual:$wgTranslateNumerals [19:15:30] but we can try and see :) [19:15:33] 10SRE, 10ops-codfw, 10decommission-hardware, 10serviceops: decommission thumbor200[12].codfw.wmnet - https://phabricator.wikimedia.org/T273141 (10Papaul) 05Open→03Resolved complete [19:15:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:01] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1c082bec4c74c156b26af4349488835902c5bacd: Enable mapframe on the Indonesian Wikipedia (T295571) (duration: 00m 56s) [19:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:09] nn1l2: your patch is live [19:16:10] T295571: Enable on Indonesian-language Wikipedia - https://phabricator.wikimedia.org/T295571 [19:16:25] nn1l2: your second patch is now available at mwdebug1001, please test [19:16:56] LGTM [19:17:20] (03PS2) 10Urbanecm: Use the WikiEditor ready hook instead of using() the lib [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740563 (https://phabricator.wikimedia.org/T296033) (owner: 10Inductiveload) [19:17:40] nn1l2: thanks, syncing [19:18:26] (03CR) 10Urbanecm: [C: 03+2] Use the WikiEditor ready hook instead of using() the lib [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740563 (https://phabricator.wikimedia.org/T296033) (owner: 10Inductiveload) [19:18:31] inductiveload: +2'ed your backport, will let you know once ready [19:18:55] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4aa8d5bf465bfc3fee2ec547718af0c779f88ef4: Enable SandboxLink on lawiki (T296073) (duration: 00m 56s) [19:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:00] T296073: Enable SandboxLink on lawiki - https://phabricator.wikimedia.org/T296073 [19:19:06] mbch331: hello, do you know how to test your patch, once available for testing? [19:19:14] nn1l2: both your patches are now live. anything else i can do for you? [19:19:31] urbanecm, Thanks. All done. [19:19:37] great! [19:19:40] (03PS2) 10Urbanecm: kswiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740179 (https://phabricator.wikimedia.org/T296055) (owner: 10Tks4Fish) [19:19:45] (03CR) 10Urbanecm: [C: 03+2] kswiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740179 (https://phabricator.wikimedia.org/T296055) (owner: 10Tks4Fish) [19:20:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:12] @urbanecm: Don't think it can be tested [19:20:31] mbch331: in that case, what do you expect it to do please? [19:20:41] (03Merged) 10jenkins-bot: kswiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740179 (https://phabricator.wikimedia.org/T296055) (owner: 10Tks4Fish) [19:21:08] dontpanic: your patch is now available at mwdebug1001, please test (and confirm you're seeing a change somewhere) [19:22:29] looking [19:22:46] I do see some changes, urbanecm [19:22:55] dontpanic: which ones please? :) [19:22:58] at Special:RecentChanges, the time changes [19:23:05] it's the only thing I can see though [19:23:09] sounds like about what should happen [19:23:10] It should help somehow to sort the languages in the language drop down of create new item of Wikidata. I only know that language codes that can be used for Wikidata and are not mediawiki languages need to be added there [19:23:11] syncing it then [19:23:15] and the dates too [19:23:22] cool, ty :) [19:24:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b6b05e30b3c9b4007fd31ab0698507d7a48d1caf: kswiki: set wgTranslateNumerals to false (T296055) (duration: 00m 55s) [19:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:35] T296055: Change numerals from Devanagari and Urdu to Arabic for ks.wiki - https://phabricator.wikimedia.org/T296055 [19:24:40] dontpanic: should be live. anything else from you? [19:24:49] nope, that's it :) [19:24:50] thanks! [19:24:53] np [19:25:07] mbch331: and why do you think that "helping" behavior cannot be tested? [19:25:34] Because I have no idea how it exactly works [19:26:23] In that case, I'd prefer that patch being rescheduled for a different video (the EU is likely the best, as Lucas tends to be around) [19:26:43] I also don't have any idea about how it works, and I don't feel comfortable deploying patches when no one knows how they work :) [19:27:11] Ok. I'll see if he's comfortable deploying it without me present [19:28:00] he's much more knowledgable about WD than I am :) [19:28:25] inductiveload: fyi, currently waiting on CI for your patch [19:30:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:44] (03Merged) 10jenkins-bot: Use the WikiEditor ready hook instead of using() the lib [extensions/ProofreadPage] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740563 (https://phabricator.wikimedia.org/T296033) (owner: 10Inductiveload) [19:36:56] finally [19:37:33] inductiveload: your patch is at mwdebug1001, can you test please? [19:40:08] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:42] inductiveload: ping? [19:42:07] Sorry [19:42:11] I'm here [19:42:19] great [19:42:26] inductiveload: can you test your patch at mwdebug1001 please? [19:42:34] Yes [19:42:36] It's working [19:42:39] great [19:42:41] syncing [19:42:46] 10SRE, 10Patch-For-Review: Decommission svn.wikimedia.org server (import SVN into Phabricator) - https://phabricator.wikimedia.org/T86655 (10Xqt) [19:42:46] Thank you! [19:42:50] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:43:37] np [19:44:24] (03PS1) 10Nray: Enable reading depth instrumentation at low sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740667 (https://phabricator.wikimedia.org/T294777) [19:44:29] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/ProofreadPage/: 10b8440069ac71434274462c545c6b2b2c9182d9: Use the WikiEditor ready hook instead of using() the lib (T296033) (duration: 00m 56s) [19:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:34] T296033: Disabling toolbar no longer works in Page namespace - https://phabricator.wikimedia.org/T296033 [19:44:36] inductiveload: should be live [19:44:38] anything else? [19:44:56] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:45:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:09] (03PS2) 10Nray: Enable reading depth instrumentation at low sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740667 (https://phabricator.wikimedia.org/T294777) [19:46:20] No that's it for today, thank you [19:46:23] great [19:46:30] 'log Evening B&C window completed [19:46:53] !log Evening B&C window completed [19:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:22] (03PS3) 10Nray: Enable reading depth instrumentation at low sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740667 (https://phabricator.wikimedia.org/T294777) [19:49:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:25] Going to take another swing at the codfw elasticsearch rolling restart. Will be monitoring to see if we get a spike in pool counter rejections [19:49:55] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [19:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:59] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [19:50:35] (03PS4) 10Ryan Kemper: query_service: Generalize prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/737484 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [19:54:06] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Generalize prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/737484 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [19:56:30] (03PS1) 10Dzahn: gitlab: create role for prod gitlab-runners, adjust cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/740670 (https://phabricator.wikimedia.org/T295481) [19:57:48] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:59:06] (03CR) 10Dzahn: "Creates a puppet role for gitlab runners (in prod). Like the existing gitlab server role but without adding it to Bacula backups (?) and i" [puppet] - 10https://gerrit.wikimedia.org/r/740670 (https://phabricator.wikimedia.org/T295481) (owner: 10Dzahn) [19:59:51] (03CR) 10Dzahn: "I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/740670/ to create a role first and to answer your question how to map it." [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [19:59:58] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:00:05] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2739 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [20:00:33] ryankemper: ^^ [20:00:35] (03CR) 10Dzahn: site and install_server: add gitlab-runner1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [20:00:48] * jbond here [20:00:50] legoktm: thanks [20:01:23] * volans around if needed [20:01:33] walking home from lunch, unavailable another 10-15m [20:01:35] Rolling update just needs to be halted and things should settle. DO I need to ack the page on victorops? [20:01:47] Do* [20:01:49] I can do that [20:01:57] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705 [20:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:01] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [20:02:08] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.7097 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:02:13] (03CR) 10Dzahn: "I would suggest to first merge the new role change which also handles the cumin part.Then create VM one more time in private subnet and th" [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [20:02:18] thanks ryankemper [20:02:28] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [20:02:38] ryankemper: what's the relation with the page? [20:02:51] <_joe_> so the problem is slow elasticsearch causing starvation of php workers? [20:03:00] yes [20:03:00] ^ yes, exactly that [20:03:04] <_joe_> do we already know that? [20:03:19] The elasticsearch cluster can't keep up, which is tripping the pool counter causing rejections which are downstream imapcting the php worker availability [20:03:21] <_joe_> we might want to get more aggressive with circuit breaking then [20:03:27] https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=2&orgId=1&from=now-30m&to=now pool counter rejections [20:04:00] <_joe_> but clearly the poolcounter limits are still too high with the current latency [20:05:01] Hmm [20:05:11] <_joe_> so we need to find a sweet spot between timeouts on the appservers and poolcounter concurrency limits [20:05:41] What I was envisioning is that the php workers are trying to do something that involves talking to elasticsearch, causing jobs to take longer / retry [20:05:42] <_joe_> actually, that doesn't make much sense; poolcounter *is* limiting the concurrent number of queries to ES [20:05:56] Whereas if the cluster could handle more they would be completing promptly [20:06:01] <_joe_> yes ofc [20:06:03] Right, exactly that [20:06:28] <_joe_> poolcounter works here by limiting the maximum number of queries to ES that can be happening at any time [20:06:52] <_joe_> but I guess that number is relatively high? [20:07:15] <_joe_> I've never looked at cirrussearch's code [20:07:41] in slow log it's hanging at stuff like: [20:07:42] [0x00007fcb6f41e6f0] curl_exec() /srv/mediawiki/php-1.38.0-wmf.9/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php:166 [20:08:03] things should settle by themselves or do we need to take action to make it recovery quicker? [20:08:08] <_joe_> the situation is under control however [20:08:19] <_joe_> volans: they should settle by themselves [20:08:24] volans: Things will settle by themselves, rolling restart has been aborted [20:08:28] <_joe_> but looking at impact, it seems limited [20:08:49] <_joe_> one action we could take is lowering radically the timeout to ES in envoy to something like 2 seconds [20:09:18] <_joe_> for the duration of maintenance, a lot of queries to the cluster under maintenance will fail [20:09:34] _joe_: The intended result being that the php workers wouldn't be sitting twiddling their thumbs basically? [20:09:46] <_joe_> ryankemper: yeah [20:09:55] <_joe_> so either that, or lowering the concurrency in poolcounter [20:10:00] <_joe_> which is also a possibility [20:10:13] <_joe_> and maybe the gentler solution [20:10:21] Do we know what the retry behavior is? That sounds reasonable but I'd want to be sure that it doesn't result in a retry explosion [20:10:46] _joe_: I think lowering concurrency makes a lot of sense because even with the pool counter kicking in we're still seeing this downstream behavior [20:10:53] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2825 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [20:11:06] Meaning that the es cluster can't handle work properly at the current concurrency level [20:12:01] <_joe_> it's not recovering it seems [20:12:34] <_joe_> legoktm: if it persists, I would suggest lowering the concurrency for poolcounter for the ES queries [20:13:00] I was just looking for that [20:13:01] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/b6b05e30b3c9b4007fd31ab0698507d7a48d1caf/wmf-config/PoolCounterSettings.php#18 [20:13:46] do we know which category these fall under? or just cut them all? [20:14:06] It should recover, usually when ES gets overwhelmed it takes a bit of time for it to get into a happy state again [20:14:19] (We should still look into lowering the concurrency regardless) [20:14:34] fyi i just got an user report that searching at WD is quite slow, and autocomplete for adding statements almost doesn't work (due to search taking long time) [20:15:14] urbanecm: ack, they should hopefully see things improve in 10 minutes or so [20:15:21] sounds good, thanks [20:15:26] Yeah, it's stupid slow at the moment [20:15:54] grafana seems to be showing mild recovery (albeit probably too soon to say that for sure) [20:16:32] Generally when it recovers it will go from blocking 50% of all requests to blocking 0% in a matter of 60 seconds [20:16:42] Fingers crossed we should be just a few minutes away from seeing that [20:17:24] For example here's the recovery of when the same thing happened a few hours ago: https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=2&orgId=1&from=1637603531439&to=1637603829372 [20:18:02] (03PS1) 10Legoktm: Lower CirrusSearch maxqueues to be closer to number of workers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740674 [20:19:41] ryankemper: ^ should make it fail faster without affecting the number of queries that go through [20:20:49] legoktm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/740674/1/wmf-config/PoolCounterSettings.php#66 looks reasonable...I/we will need to decide if we want to apply this now or wait for the cluster to stabilize first [20:20:57] I think applying it now is probably the right call [20:21:01] (03CR) 10Legoktm: [C: 03+2] Lower CirrusSearch maxqueues to be closer to number of workers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740674 (owner: 10Legoktm) [20:21:20] legoktm: what's the rollout process? Just a `scap sync-file` or similar? [20:21:21] me too, once the cluster recovers we can revert it and then figure out proper numbers [20:21:26] yeah, I'll take care of it [20:22:12] (03Merged) 10jenkins-bot: Lower CirrusSearch maxqueues to be closer to number of workers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740674 (owner: 10Legoktm) [20:22:21] search on mobile app for me just started to work again fwiw [20:22:43] <_joe_> it got intimidated by the merge [20:22:47] not really, just intermittent [20:22:50] xD [20:23:01] volans: I was gonna say, i'm not seeing recovery on the graph yet so you probably just won the coinflip [20:23:09] It's about 50% chance of a request making it through (all else equal) [20:23:22] syncing [20:23:27] legoktm: thanks [20:23:53] !log legoktm@deploy1002 Synchronized wmf-config/PoolCounterSettings.php: Lower CirrusSearch maxqueues to be closer to number of workers (duration: 00m 56s) [20:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:35] Ha, instant plunge on the idle vs active graph [20:24:37] graph starting to look better [20:24:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:11] Looking way better, only 10% of reqs blocked at last update [20:25:33] 10SRE, 10Platform Engineering, 10Traffic, 10Patch-For-Review, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Umherirrender) >>!... [20:25:54] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:26:01] did ES recover at the same time then? the change should have increased the # of PC rejections [20:26:01] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6881 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [20:27:03] I think it probably did...or possibly it helped because the workers were causing some type of thrash, but I lean more towards the former explanation [20:27:15] * volans going back off then as we're out of the wookds [20:27:17] *woods [20:28:03] the deploy also would've restarted some PHP-FPM instances which also could've helped [20:28:21] Actually that probably would have helped quite a bit [20:28:47] I was leaning against the thrash explanation because the raw number of requests doesn't seem to increase during these incidents, but actually just having those synchronous api calls hanging is definitely gonna impact ES' ability to get work done [20:28:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:58] ccccccvkvhgbjvgvudbftkkevidkjdbrvjuldjvebtug [20:30:12] arnoldokoth: couldn't have said it better myself [20:30:24] :P [20:30:28] Sorry. :'( [20:30:30] Alright so incident's resolved. We'll want to take a harder look at the pool counter tuning, but I suspect that legoktm's approach of having the limit be slightly more than the number of workers is probably about where we want it anyway. Need to do some thinking though [20:30:42] do you still have more nodes that need rebooting? [20:31:02] s/reboot/restart, but yes [20:32:17] The only knob I have to turn is the # of hosts being restarted. We've been doing batches of 3 concurrently [20:32:28] would spacing it out help? [20:32:39] With this latest restart as an example the very first round of restarts plunged the cluster so there's not any fiddling with sleeps etc that can be done [20:32:40] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [20:32:53] I was about to ask about that -- is the real issue here just that the cluster can't withstand the loss of three nodes at once? [20:33:00] But yes the # of hosts being restarted concurrently could be slowed down [20:33:12] well, "real issue" :) the trigger, maybe [20:33:28] rzl: ha, was about to wax philosophical on proximate vs root cause but you beat me to it :P [20:33:51] Yes, the proximate cause / problem is that we can't handle 3 hosts being restarted concurrently, but we should be able to handle that [20:34:11] haha I avoided "root cause" on purpose, let's definitely not get into that [20:34:16] Hopefully with the new tuning, future pool counter rejection spikes should not lead to the downstream workers getting starved [20:34:17] nod, makes sense [20:34:52] ie hopefully we can keep it isolated to just the cirrus cluster itself and not be paging on the broader worker availability [20:35:03] I guess two separate issues though, right -- I agree, the poolcounter change should avoid turning an elasticsearch capacity issue into an appserver capacity issue [20:35:09] but don't we still have an elasticsearch capacity issue? [20:35:24] or is it intended that we'll reject some traffic during the restart [20:36:04] rzl: yes, we absolutely have an elasticsearch capacity issue. the situation will improve in Q3 when we're scaling up each dc from 36->50 total hosts [20:36:06] naively I would have figured the point of the rolling restart is you're always running enough machines to serve the actual load [20:36:10] ohhh okay [20:36:13] yes [20:36:14] sorry, I'm way behind then :) thanks [20:36:52] rough math, we on average have ~3k idle PHP-FPM workers, with the old limits, we could have 1,910 workers either waiting for ES or in the PC queue waiting, with new limits its capped at 1,035 workers [20:36:53] rzl: no worries, good questions :) and FWIW we had the idea that we were a little underprovisioned for awhile now, but it wasn't really until today that it became apparent exactly how much [20:37:06] nod [20:37:54] can we get away with doing batches of 1 instead of 3? or do we get into operational problems from how much longer it would take? [20:39:03] and I guess related question, is it easier if we do this in the daily trough instead of at peak? [20:39:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:39:26] In the short / medium term batches of 1 are fine. The only thing is there's definitely a non-linear component to cluster impact, i.e. the act of even one host restarting is still going to cause a good bit of reshuffling (but it should be significantly less than 3) [20:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:49] * legoktm will file a task about adjusting PoolCounter config [20:39:54] Trough is definitely an option, but the other side of that is if we do page and I don't catch the page in time that's closer to when you europeans will actually be sleeping [20:40:36] 10SRE, 10Platform Engineering, 10Traffic, 10Patch-For-Review, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10MdsShakil) Done. L... [20:40:37] don't look at me, I'm in San Francisco ;) but point taken [20:40:57] we should have pretty decent SRE coverage across the day though, if needed [20:42:00] s/you europeans/our europeans :P [20:42:45] I suspect that one host at a time together with the new tuning won't lead to the worker starvation issue even during peak, but I would not bet the farm on it certainly [20:43:02] nod [20:43:14] definitely reasonable to try changing one thing at a time [20:43:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:56] I think the API cluser will still feel some impact, but will have more idle workers than active ones if I did my math right [20:44:53] Right [20:45:05] Well some mild impact but not enough to page should be okay I'd imagine [20:45:32] As for the restarts, I think it's really a choice between 1 host at a time starting in an hour, or the standard 3 hosts at a time during trough [20:45:48] in the short term I think so [20:45:56] Since 1-at-a-time during the trough would probably not finish before I'd have to sign off for the night [20:46:41] so while the Europeans are awake? ;-) [20:47:07] legoktm: Point taken :P [20:49:33] 10SRE, 10Discovery, 10Sustainability (Incident Followup): Adjust CirrusSearch PoolCounter limits - https://phabricator.wikimedia.org/T296224 (10Legoktm) [20:51:34] There actually are some other knobs we have to fiddle with. We have something called the sane-itizer that makes sure that elasticsearch state lines up with what's in the app DBs, and that process is responsible for quite a bit of load itself [20:51:47] So turning that off should improve things a good bit [20:52:28] nice [20:52:37] I've been saying we could use a little less sane around here [20:52:43] :P [20:57:42] (03PS1) 10Papaul: Add ganeti200[18] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/740679 (https://phabricator.wikimedia.org/T294139) [21:00:05] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211122T2100) [21:01:38] (03PS2) 10Papaul: Add ganeti200[78] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/740679 (https://phabricator.wikimedia.org/T294139) [21:04:30] (03CR) 10Papaul: [C: 03+2] Add ganeti200[78] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/740679 (https://phabricator.wikimedia.org/T294139) (owner: 10Papaul) [21:07:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2027.codfw.wmnet with OS buster [21:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ganeti2027.codfw.wmnet with OS buster [21:08:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10Papaul) [21:25:20] 10SRE, 10Discovery, 10Wikimedia-Site-requests, 10Sustainability (Incident Followup): Adjust CirrusSearch PoolCounter limits - https://phabricator.wikimedia.org/T296224 (10Reedy) [21:28:54] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10KSiebert) @Jelto Is there any update on this? We would need this one and this one https://phabricator.wikimedia.org/T296161 before the holiday. All the best and th... [21:36:12] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10Urbanecm) >>! In T295777#7521767, @KSiebert wrote: > @Jelto @RhinosF1 Is there any update on this? We would need this one and this one https://phabricator.wikimed... [21:36:59] (03PS1) 10Cwhite: profile: turn off grafana db sync ahead of 8.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/740682 (https://phabricator.wikimedia.org/T282863) [21:37:31] 10SRE, 10Community-Tech, 10LDAP-Access-Requests: Grant Access to Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Urbanecm) Noting https://ldap.toolforge.org/user/samwilson is already in the wmf group, so @Samwilson should be able to get past the login screen at superset.wikimedia.org an... [21:38:32] (03PS1) 10Razzi: superset: make webserver timeout 3 minutes [puppet] - 10https://gerrit.wikimedia.org/r/740683 [21:39:49] (03PS2) 10Razzi: superset: make webserver timeout 3 minutes [puppet] - 10https://gerrit.wikimedia.org/r/740683 (https://phabricator.wikimedia.org/T294771) [21:40:33] (03CR) 10jerkins-bot: [V: 04-1] superset: make webserver timeout 3 minutes [puppet] - 10https://gerrit.wikimedia.org/r/740683 (https://phabricator.wikimedia.org/T294771) (owner: 10Razzi) [21:42:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2027.codfw.wmnet with OS buster [21:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:50] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2027.codfw.wmnet with OS buster completed: - ganeti2027 (**PASS**)... [21:51:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2028.codfw.wmnet with OS buster [21:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:06] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ganeti2028.codfw.wmnet with OS buster [21:53:07] RECOVERY - Check systemd state on elastic2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:55] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10LClightcat) >>! 在T294676#7520792中,@Legoktm写道: > Sorry about delays - I was going to create the mailing lists just now and realized they don't fit our guidelines for names (http... [22:00:05] Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211122T2200). [22:20:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2028.codfw.wmnet with OS buster [22:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:39] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2028.codfw.wmnet with OS buster completed: - ganeti2028 (**PASS**)... [22:24:17] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10Papaul) [22:24:39] 10SRE, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff all yours [22:26:37] (03PS1) 10Nray: Restore ReadingDepth instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740613 (https://phabricator.wikimedia.org/T294777) [22:32:57] 10SRE, 10ops-eqiad, 10serviceops-radar: mw1448.mgmt alert - https://phabricator.wikimedia.org/T296041 (10Jclark-ctr) 05Open→03Resolved Replaced mgmt cable now has link [22:38:45] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10KSiebert) I do have superset access already but can not see all data some is private and my NDA needs to be attached I assume. For example, I would like to see thi... [22:39:03] 10SRE, 10ops-eqiad, 10serviceops-radar: mw1448.mgmt alert - https://phabricator.wikimedia.org/T296041 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [22:43:41] 10SRE, 10Community-Tech, 10LDAP-Access-Requests: Grant Access to Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10KSiebert) Hey there @MMandere and @Urbanecm, he needs access to private data as well. [22:44:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2009.codfw.wmnet with OS stretch [22:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:39] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2009.codfw.wmnet with OS stretch [22:45:28] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10Urbanecm) In this case, this looks to be a request for https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Dashboards_in_Superset_/_Hive_interfaces_(like_Hue... [22:50:14] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10Legoktm) [22:51:57] (03PS1) 10Zabe: Revert "Localisation updates from https://translatewiki.net." [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740688 (https://phabricator.wikimedia.org/T296203) [22:53:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2010.codfw.wmnet with OS stretch [22:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:58] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2010.codfw.wmnet with OS stretch [22:54:41] (03CR) 10Clare Ming: [C: 03+1] Enable reading depth instrumentation at low sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740667 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [22:56:30] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10Legoktm) 05Open→03Resolved Both lists have been created, please update the description and any other settings as necessary (or re-open if something is totally wrong) * htt... [22:58:31] (03PS2) 10Zabe: Revert "Localisation updates from https://translatewiki.net." [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740688 (https://phabricator.wikimedia.org/T296203) [23:00:26] (03PS1) 10Zabe: Revert "Create redirect Special Pages for delete and protect action" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740689 (https://phabricator.wikimedia.org/T295611) [23:01:14] (03PS2) 10Zabe: Revert "Create redirect Special Pages for delete and protect action" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740689 (https://phabricator.wikimedia.org/T295611) [23:02:38] (03PS2) 10Dzahn: gitlab: create role for prod gitlab-runners, adjust cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/740670 (https://phabricator.wikimedia.org/T295481) [23:03:16] (03CR) 10Dzahn: [C: 03+2] gitlab: create role for prod gitlab-runners, adjust cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/740670 (https://phabricator.wikimedia.org/T295481) (owner: 10Dzahn) [23:05:52] (03PS3) 10Zabe: Revert "Create redirect Special Pages for delete and protect action" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740689 (https://phabricator.wikimedia.org/T295611) [23:11:56] RECOVERY - DNS on mw1448.mgmt is OK: DNS OK: 0.017 seconds response time. mw1448.mgmt.eqiad.wmnet returns 10.65.1.26 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:12:16] (03CR) 10Dzahn: "created new role in https://gerrit.wikimedia.org/r/c/operations/puppet/+/740670" [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [23:14:51] (03PS2) 10Dzahn: site and install_server: add gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [23:16:43] (03CR) 10jerkins-bot: [V: 04-1] site and install_server: add gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [23:17:49] (03PS1) 10Nray: Update access_method value in reading depth instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740690 (https://phabricator.wikimedia.org/T294777) [23:18:16] 10SRE, 10Community-Tech, 10LDAP-Access-Requests: Grant Access to Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) Sorry! I totally meant to add that. Thanks. [23:18:24] (03CR) 10Dzahn: site and install_server: add gitlab-runner1001 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [23:18:53] 10SRE, 10Community-Tech, 10LDAP-Access-Requests: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) [23:18:55] (03CR) 10Dzahn: "jenkins will be happy once the MAC address is updated" [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [23:18:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2009.codfw.wmnet with OS stretch [23:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2009.codfw.wmnet with OS stretch completed: - ms-fe2009 (*... [23:20:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2011.codfw.wmnet with OS stretch [23:20:30] (03PS1) 10Dzahn: site: use gitlab_runner role on gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740691 (https://phabricator.wikimedia.org/T295481) [23:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2011.codfw.wmnet with OS stretch [23:22:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe2010.codfw.wmnet with OS stretch [23:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2010.codfw.wmnet with OS stretch executed with errors: - m... [23:22:29] (03CR) 10Dzahn: [C: 03+2] cache::text: remove config for scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/739660 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [23:24:38] (03PS4) 10Nray: Enable reading depth instrumentation at low sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740667 (https://phabricator.wikimedia.org/T294777) [23:25:06] (03CR) 10Dzahn: "cp1079 - alternate_domains.add("\Qscholarships.wikimedia.org\E"); Scheduling refresh of Exec[load-new-vcl-file-frontend] -" [puppet] - 10https://gerrit.wikimedia.org/r/739660 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [23:26:10] (03CR) 10Clare Ming: [C: 03+1] Restore ReadingDepth instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740613 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [23:26:23] (03CR) 10Clare Ming: [C: 03+1] Update access_method value in reading depth instrument [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740690 (https://phabricator.wikimedia.org/T294777) (owner: 10Nray) [23:28:35] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10Papaul) [23:31:25] (03PS2) 10Dzahn: deployment_server: remove scholarships [puppet] - 10https://gerrit.wikimedia.org/r/739663 [23:33:08] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/32556/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/739663 (owner: 10Dzahn) [23:35:59] (03PS2) 10Dzahn: wikimania_scholarships: delete module and profile, remove from miscweb [puppet] - 10https://gerrit.wikimedia.org/r/739658 (https://phabricator.wikimedia.org/T243037) [23:39:49] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/32559/" [puppet] - 10https://gerrit.wikimedia.org/r/739658 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [23:40:57] (03CR) 10Dzahn: [V: 03+1 C: 03+2] acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:46:33] (03CR) 10Dzahn: "acmechief2001 & acmechief-test2001: noop. acmechief1001 & acmechief-test1001: cron job removed by puppet, new units created." [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:48:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe2011.codfw.wmnet with OS stretch [23:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe20[09-12] - https://phabricator.wikimedia.org/T294136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2011.codfw.wmnet with OS stretch executed with errors: - m... [23:50:09] (03PS1) 10Dzahn: acme_chief: new timer needs to use '1h' not 'hourly' [puppet] - 10https://gerrit.wikimedia.org/r/740692 (https://phabricator.wikimedia.org/T273673) [23:51:02] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:51:13] (03CR) 10Dzahn: [C: 03+2] acme_chief: new timer needs to use '1h' not 'hourly' [puppet] - 10https://gerrit.wikimedia.org/r/740692 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:52:04] (03CR) 10Dzahn: "needed follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/740692" [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:53:18] (03CR) 10Dzahn: "Nov 22 23:42:05 acmechief1001 systemd[1]: /lib/systemd/system/reload-acme-chief-backend.timer:6: Failed to parse timer value, ignoring: ho" [puppet] - 10https://gerrit.wikimedia.org/r/740692 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:54:08] !log acmechief1001, acmechief-test1001: sudo systemctl start reload-acme-chief-backend.timer [23:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:46] !log acmechief1001, acmechief-test1001: sudo systemctl restart reload-acme-chief-backend.timer [23:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:03] (03PS1) 10Dzahn: acme_chief: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/740693 (https://phabricator.wikimedia.org/T273673) [23:58:57] (03CR) 10Dzahn: snapshot: replace the word cron everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn)