[00:00:04] RoanKattouw and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T0000). [00:00:04] No Gerrit patches in the queue for this window AFAICS. [00:18:07] PROBLEM - dump of s1 in codfw on alert1001 is CRITICAL: dump for s1 at codfw taken more than 8 days ago: Most recent backup 2021-11-16 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:25:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2012.codfw.wmnet with OS buster [00:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2012.codfw.wmnet with OS buster completed:... [00:28:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10Papaul) [00:28:23] PROBLEM - dump of s1 in eqiad on alert1001 is CRITICAL: dump for s1 at eqiad taken more than 8 days ago: Most recent backup 2021-11-16 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:34:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10Papaul) 05Open→03Resolved complete [00:34:26] 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) p:05Triage→03Medium [04:00:37] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 91.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:14:07] RECOVERY - Check systemd state on ms-fe2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:19] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:20:45] PROBLEM - Check systemd state on ms-fe2010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [04:45:25] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:46:13] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [04:53:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [04:57:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [04:59:37] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:08:23] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 91.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:12:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [05:17:07] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:34:30] (03PS1) 10Krinkle: alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) [05:34:52] (03CR) 10jerkins-bot: [V: 04-1] alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle) [05:37:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [05:37:08] (03PS2) 10Krinkle: alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) [05:47:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: After optimize table (T296143)', diff saved to https://phabricator.wikimedia.org/P17804 and previous config saved to /var/cache/conftool/dbconfig/20211124-054718-root.json [05:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:23] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [05:48:39] PROBLEM - dump of m1 in codfw on alert1001 is CRITICAL: dump for m1 at codfw taken more than 8 days ago: Most recent backup 2021-11-16 05:19:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:58:07] (03PS2) 10Marostegui: dbproxy10{17,21}: Change m5 standby host [puppet] - 10https://gerrit.wikimedia.org/r/740839 (https://phabricator.wikimedia.org/T288720) [05:58:51] (03CR) 10Marostegui: [C: 03+2] dbproxy10{17,21}: Change m5 standby host [puppet] - 10https://gerrit.wikimedia.org/r/740839 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [06:00:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [06:02:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: After optimize table (T296143)', diff saved to https://phabricator.wikimedia.org/P17805 and previous config saved to /var/cache/conftool/dbconfig/20211124-060221-root.json [06:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:26] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [06:03:12] (03PS1) 10Marostegui: wmnet: Restore TTL back to 5M for m5-master [dns] - 10https://gerrit.wikimedia.org/r/740964 (https://phabricator.wikimedia.org/T288720) [06:04:08] (03CR) 10Marostegui: [C: 03+2] wmnet: Restore TTL back to 5M for m5-master [dns] - 10https://gerrit.wikimedia.org/r/740964 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [06:05:54] !log Upgrade db1128's kernel T288720 [06:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:58] T288720: Failover m5 master (db1128) to db1132 to upgrade its kernel - https://phabricator.wikimedia.org/T288720 [06:17:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: After optimize table (T296143)', diff saved to https://phabricator.wikimedia.org/P17806 and previous config saved to /var/cache/conftool/dbconfig/20211124-061725-root.json [06:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:30] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [06:19:38] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, 10Patch-For-Review: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10Marostegui) p:05Triage→03Medium [06:19:49] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Marostegui) p:05Triage→03Medium [06:28:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1065-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [06:32:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: After optimize table (T296143)', diff saved to https://phabricator.wikimedia.org/P17807 and previous config saved to /var/cache/conftool/dbconfig/20211124-063228-root.json [06:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:34] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [06:38:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1065-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [06:45:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance (T296143) [06:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance (T296143) [06:45:07] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [06:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:17] !log running optimize table with replication on db1155:3314 (T296143) [06:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:42] Since I just started a schema change, I go afk for a while [07:06:03] (03PS1) 10Marostegui: db1128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/740967 (https://phabricator.wikimedia.org/T295965) [07:07:25] (03CR) 10Marostegui: [C: 03+2] db1128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/740967 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [07:12:01] !log drop /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 and other blockmgr-* dirs on stat1006 to free space on the root partition [07:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:59] ACKNOWLEDGEMENT - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T296300 [07:18:51] 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10elukey) The host went down again, I acked the alert and didn't reboot it :) [07:22:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:23:09] !log reboot kubernetes1018 (role::insetup) to verify negotiated speed of eth interface [07:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [07:28:29] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32594/console" [puppet] - 10https://gerrit.wikimedia.org/r/740763 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [07:29:24] (03PS2) 10Giuseppe Lavagetto: hieradata: Route search.wm.o to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [07:29:58] kubernetes1018 seems not coming up from the reboot, nice [07:30:16] ah no it was only super slow, let's see [07:30:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32595/console" [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [07:34:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 04-1] "We don't need to add search.wm.org to the alternate domains." [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [07:40:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bullseye [07:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:22] (03PS3) 10Majavah: hieradata: Route search.wm.o to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) [08:04:38] (03CR) 10Majavah: hieradata: Route search.wm.o to apple-search (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [08:05:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1125.eqiad.wmnet with OS bullseye [08:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:37] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] trafficserver: rule for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/740763 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto) [08:14:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:15:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [08:16:44] (03PS1) 10Majavah: Remove search.wikimedia.org from appservers [puppet] - 10https://gerrit.wikimedia.org/r/741079 (https://phabricator.wikimedia.org/T289224) [08:17:35] <_joe_> majavah: hold your horses :D [08:18:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:22:45] RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:00] (03PS1) 10Muehlenhoff: Extend access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/741080 [08:25:28] (03PS2) 10Muehlenhoff: Extend access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/741080 [08:25:55] PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:00] (03CR) 10MMandere: [C: 03+2] admin: Add samwilson to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740826 (https://phabricator.wikimedia.org/T296161) (owner: 10MMandere) [08:27:25] (03PS3) 10Muehlenhoff: Extend access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/741080 [08:30:13] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/741080 (owner: 10Muehlenhoff) [08:31:52] 10SRE, 10SRE-Access-Requests, 10Community-Tech, 10Patch-For-Review: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MMandere) 05Open→03Resolved a:03MMandere @Samwilson you now should be able to access the private data. Please let us know if you face any ch... [08:33:09] _joe_: it's broken :( [08:34:14] <_joe_> majavah: uh what do you mean? [08:34:37] <_joe_> how did you even end up in codfw [08:34:55] <_joe_> that is, indeed, the only server that should be pointing to apple search [08:35:02] <_joe_> but it doesn't in my tests [08:35:03] testing from a VPS [08:35:29] <_joe_> connect failed, is also quite peculiar [08:36:02] I can also reproduce locally, "curl -k -H "Host: search.wikimedia.org" https://text-lb.codfw.wikimedia.org/huoh" gives a 502 with that [08:37:04] <_joe_> oh wait [08:37:08] <_joe_> that's not a valid request [08:37:36] it's just something to bypass caching, but it probably should not give a 502 [08:37:43] <_joe_> uhm [08:37:46] <_joe_> yeah that's strange [08:37:51] <_joe_> very strange [08:38:00] <_joe_> btw [08:38:06] <_joe_> I just ran on a single backend [08:38:17] <_joe_> so I don't get why all requests seem to be funneled through it [08:38:54] <_joe_> could not connect [CONNECTION_ERROR] to 10.2.1.68 for 'https://apple-search.discovery.wmnet:4013/?search=test' [08:39:03] <_joe_> well if I use curl from the same server [08:39:04] <_joe_> it works [08:39:59] <_joe_> so, no idea what's wrong there [08:40:55] (03PS1) 10Ladsgroup: Set actor migration to write both on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741082 (https://phabricator.wikimedia.org/T275246) [08:41:35] !log depool cp2027 [08:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:30] 10SRE, 10SRE-Access-Requests, 10Community-Tech, 10Patch-For-Review: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) Thanks! I'm still getting an error when I try to view FROM `event.visualeditorfeatureuse`: > Permission denied: user=samwilson, acces... [08:48:04] (03PS1) 10Vgutierrez: cp5006: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005) [08:49:01] jouncebot: nowandnext [08:49:01] No deployments scheduled for the next 3 hour(s) and 10 minute(s) [08:49:01] In 3 hour(s) and 10 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1200) [08:49:05] cool [08:49:09] (03CR) 10Ladsgroup: [C: 03+2] Set actor migration to write both on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741082 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [08:49:53] (03Merged) 10jenkins-bot: Set actor migration to write both on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741082 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [08:51:28] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apple-search' for release 'main' . [08:51:29] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:741082|Set actor migration to write both on all wikis (T275246)]] (duration: 00m 57s) [08:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:34] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [08:52:30] Just a short reminder: we will start re-deploy services in codfw Kubernetes cluster soon. Feel free to ping me any time. [08:53:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:54:38] (03PS1) 10Muehlenhoff: Add Cumin alias for wcqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/741084 [08:55:17] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apple-search' for release 'main' . [08:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:03] <_joe_> !log repooling cp2027 [08:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [08:59:07] <_joe_> majavah: fixed [08:59:26] yeah, and I see my curls in logstash [08:59:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [09:01:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM deneb.codfw.wmnet [09:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [09:04:31] !log start re-deploy procedure in codfw Kubernetes T251305 [09:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:35] T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 [09:07:40] <_joe_> jelto: if you're depooling all services, remember apple-search :P [09:08:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [09:08:21] <_joe_> !log switching search.wikimedia.org to be served by the apple-search servcie [09:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:51] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:09:47] joe: I added apple-search to the list recently ;) [09:10:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM deneb.codfw.wmnet [09:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:57] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:11:03] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on apertium.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:11:04] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on apertium.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:06] T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 [09:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:15] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [09:12:34] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle) [09:12:43] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm not familiar with all the different bits e.g. if they require a restart but can merge the patch" [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle) [09:13:14] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on api-gateway.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:15] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on api-gateway.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:19] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on apple-search.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:20] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on apple-search.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:25] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on blubberoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:26] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on blubberoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:30] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on citoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:32] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on citoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:35] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on cxserver.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:36] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cxserver.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:40] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on echostore.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:41] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on echostore.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:44] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventgate-analytics.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:46] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventgate-analytics.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:48] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventgate-analytics-external.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:50] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventgate-analytics-external.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:53] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventgate-logging-external.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:55] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventgate-logging-external.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:58] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventgate-main.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:59] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventgate-main.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:02] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventstreams.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:04] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventstreams.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:06] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventstreams-internal.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:08] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventstreams-internal.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:11] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on linkrecommendation.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:12] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on linkrecommendation.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:15] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mathoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:16] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mathoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:20] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mobileapps.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:21] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mobileapps.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:24] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on proton.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:26] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on proton.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:28] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on push-notifications.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:30] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on push-notifications.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:33] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on recommendation-api.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:35] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on recommendation-api.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:38] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on sessionstore.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:40] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on sessionstore.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:42] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:44] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:47] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox-constraints.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:48] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox-constraints.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:51] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox-media.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:53] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox-media.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:55] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox-syntaxhighlight.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:57] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox-syntaxhighlight.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:00] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox-timeline.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:02] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox-timeline.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:04] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on similar-users.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:05] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on similar-users.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:08] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on tegola-vector-tiles.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:10] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on tegola-vector-tiles.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:13] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on termbox.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:14] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on termbox.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:17] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on wikifeeds.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:18] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wikifeeds.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:21] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on zotero.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:22] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on zotero.svc.codfw.wmnet with reason: helm3 de-deploy T251305 [09:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:00] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [09:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:07] T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 [09:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:44] jelto: pro-tip you can use multiple names for a single call to the downtime cookbook [09:17:19] ;) [09:17:44] volands: thanks, I will try that the next time :) sorry for the spam [09:18:23] marostegui: cumin cumin [09:18:26] * elukey runs away [09:18:29] :-P [09:18:53] haha [09:19:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [09:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM failoid2002.codfw.wmnet [09:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:21] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:10] <_joe_> elukey: please put the cumin you bought in the spicerack, near the nextbox. Thanks. [09:22:14] <_joe_> *netbox [09:22:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM failoid2002.codfw.wmnet [09:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:44] (03PS2) 10Vgutierrez: cp5006: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005) [09:24:36] !log jelto@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=(apertium|api-gateway|blubberoid|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventstreams|eventstreams-internal|linkrecommendation|mathoid|mobileapps|proton|push-notifications|recommendation-api|sessionstore|shellbox|shellbox-constraints|shellbox-media|shellbox-syntaxhighlight|she [09:24:36] llbox-timeline|similar-users|tegola-vector-tiles|termbox|wikifeeds|zotero) [09:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:03] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32597/console" [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:26:30] <_joe_> jelto: I don't see apple-search there [09:27:25] apple-search is not pooled in codfw currently.. so I did not touch apple-search confctl [09:27:26] {"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=apple-search"} [09:28:47] it's not pooled in eqiad as well _joe_ [09:28:58] <_joe_> oh right [09:29:11] <_joe_> well if it's depooled on both sides, it results as pooled in both [09:29:15] <_joe_> so let me pool eqiad [09:29:22] oh...TIL [09:30:03] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=apple-search,name=eqiad [09:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM planet2002.codfw.wmnet [09:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:26] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere) [09:31:45] (03PS3) 10Vgutierrez: cp5006: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005) [09:32:42] _joe_: so IIUC there is a fallback in pybal so that if both DCs are pooled=false it treats both of them as if they where pooled? [09:32:57] <_joe_> jayme: pybal has nothing to do with this [09:33:00] <_joe_> it's the dns [09:33:07] <_joe_> for a/a services, it does [09:33:08] gdns, sorry [09:33:16] <_joe_> for a/p services, it sends you to failoid IIRC [09:34:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM planet2002.codfw.wmnet [09:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:23] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere) @nskaggs please help approving Taavi's request. [09:35:07] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32598/console" [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:37:30] Any SREer around to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/714068 for me (Beta Cluster change)? [09:40:20] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [09:41:09] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10Urbanecm) This has my support. Majavah is very helpful, and this level of access would definitely let them to be even more helpful :-). [09:41:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM mx2001.wikimedia.org [09:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:08] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [09:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:52] (03PS1) 10Elukey: WIP - kserve-inference: add support for local tls proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/741092 [09:45:19] !log depool cp5006 - T290005 [09:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:23] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:45:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM mx2001.wikimedia.org [09:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install2003.wikimedia.org [09:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [09:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:41] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cp5006: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:48:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [09:49:33] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet [09:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install2003.wikimedia.org [09:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:39] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' . [09:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet [09:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [09:53:42] !log restart varnish/haproxy on cp5006 - T290005 [09:53:42] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet [09:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:45] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:25] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [09:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:51] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [09:55:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM debmonitor2002.codfw.wmnet [09:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet [09:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:35] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [09:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM debmonitor2002.codfw.wmnet [09:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:29] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - apple-search_4013: Servers kubernetes2010.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:58:51] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apple-search' for release 'main' . [09:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:03] ah, great - this is you jelto ^ [09:59:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "+1 for the Pontoon bits, thank you Majavah" [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [09:59:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetboard2001.codfw.wmnet [09:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:28] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet [10:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:37] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:01:04] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10elukey) [10:01:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetboard2001.codfw.wmnet [10:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:04] !log repool cp5006 - T290005 [10:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:08] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:02:38] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet [10:02:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetboard2002.codfw.wmnet [10:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:59] 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) First of all, Sorry it took me so long to comment. Vacation, onboarding, etc. I was involved in the work of collapsing a... [10:04:02] (03PS1) 10Inductiveload: enwikisource: enable anonymous talk page mobile tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741097 (https://phabricator.wikimedia.org/T54165) [10:06:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetboard2002.codfw.wmnet [10:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:33] 10SRE, 10vm-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) 05Open→03Resolved [10:06:42] (03PS2) 10Inductiveload: enwikisource: enable anonymous talk page mobile tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741097 (https://phabricator.wikimedia.org/T47955) [10:06:55] !log downtime PyBal backends health check for helm3 de-deploy T251305. I'm keeping an eye on icing and remove downtime as soon as I'm finished [10:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:58] T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 [10:07:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [10:08:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [10:10:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [10:12:01] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [10:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:43] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:02] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:09] 10SRE, 10ops-eqiad, 10serviceops: Kubernetes1018's eth negotiated speed is 10MB/s - https://phabricator.wikimedia.org/T296369 (10ayounsi) That looks like a faulty cable or interface, over to DCops for troubleshooting, let us know if you need Netops help. [10:17:56] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:54] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [10:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:29] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:51] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:45] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:10] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:15] !log disable ping-offload for codfw - T294119 [10:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:18] T294119: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 [10:25:28] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:59] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'production' . [10:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_echostore_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:28:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ping2001.codfw.wmnet [10:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:05] ^ thats me, redeploying echostore [10:30:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:30:43] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere) p:05Triage→03Medium [10:32:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ping2001.codfw.wmnet [10:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:44] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [10:33:44] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [10:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:53] (03PS1) 10Arturo Borrero Gonzalez: sre.hosts.upgrade-and-reboot: update reference to IcingaHost [cookbooks] - 10https://gerrit.wikimedia.org/r/741100 [10:36:07] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [10:36:07] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [10:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:10] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MMandere) 05Resolved→03Open @Samwilson, checking I'll advise once done. [10:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:11] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [10:38:11] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [10:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM people2002.codfw.wmnet [10:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:06] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) 05Open→03Resolved @MMandere don't worry, it's working now! :-) thanks! [10:40:15] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [10:40:15] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [10:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM people2002.codfw.wmnet [10:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:10] (03CR) 10Volans: [C: 03+1] "LGTM, thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/741100 (owner: 10Arturo Borrero Gonzalez) [10:42:18] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [10:42:18] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [10:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:44:20] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [10:44:29] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [10:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:31] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MMandere) Great, you're welcome! Is there something else you did for it to start working? [10:46:43] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [10:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:00] (03CR) 10Volans: "Possible typos, not 100% sure." [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff) [10:47:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:47:42] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [10:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM xhgui2001.codfw.wmnet [10:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:02] !log rollback: disable ping-offload for codfw - T294119 [10:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:05] T294119: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 [10:49:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM xhgui2001.codfw.wmnet [10:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [10:50:06] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [10:50:06] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sre.hosts.upgrade-and-reboot: update reference to IcingaHost [cookbooks] - 10https://gerrit.wikimedia.org/r/741100 (owner: 10Arturo Borrero Gonzalez) [10:51:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM webperf2001.codfw.wmnet [10:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:22] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' . [10:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:44] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::php: support kubernetes in php-fatal-error.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [10:53:30] (03Merged) 10jenkins-bot: sre.hosts.upgrade-and-reboot: update reference to IcingaHost [cookbooks] - 10https://gerrit.wikimedia.org/r/741100 (owner: 10Arturo Borrero Gonzalez) [10:53:39] (03CR) 10Arturo Borrero Gonzalez: sre.hosts.upgrade-and-reboot: update reference to IcingaHost (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/741100 (owner: 10Arturo Borrero Gonzalez) [10:53:52] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [10:53:53] (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) [10:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:22] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) No, I don't think so. I did try logging out and in again, but the fix came some time after that. [10:55:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM webperf2001.codfw.wmnet [10:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:58] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MMandere) @Samwilson understood :) [11:01:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [11:02:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM webperf2002.codfw.wmnet [11:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:03:14] (03PS1) 10JMeybohm: Add read-only access for jayme [homer/public] - 10https://gerrit.wikimedia.org/r/741108 [11:05:27] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [11:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:06:29] ^ thats me [11:07:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM webperf2002.codfw.wmnet [11:07:03] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MoritzMuehlenhoff) @Samwilson : It seems related to Puppet (our configuration management system) run times. Your update that it was still failing happened 21 minutes af... [11:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:08:26] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:37] <_joe_> jelto: can I deploy mwdebug to codfw or should it wait? [11:09:00] joe: I'll deploy it in ~3 min if that works for you. Its next in the list [11:09:20] <_joe_> sure go on yourself then :) [11:10:01] (03PS1) 10Vgutierrez: varnish: Fix UDS check cmd [puppet] - 10https://gerrit.wikimedia.org/r/741109 (https://phabricator.wikimedia.org/T290005) [11:10:27] (03PS1) 10Jbond: wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 [11:10:29] (03PS1) 10Jbond: R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:10:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [11:11:11] (03PS2) 10Jbond: R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:11:13] (03CR) 10Ayounsi: [C: 03+1] netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309 (https://phabricator.wikimedia.org/T295148) (owner: 10Volans) [11:11:22] (03CR) 10jerkins-bot: [V: 04-1] wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 (owner: 10Jbond) [11:12:54] (03CR) 10jerkins-bot: [V: 04-1] R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond) [11:12:59] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32600/console" [puppet] - 10https://gerrit.wikimedia.org/r/741109 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:13:01] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:25] (03CR) 10jerkins-bot: [V: 04-1] R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond) [11:13:25] <_joe_> jelto: are you going to also recreate the pods? [11:13:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [11:13:37] PROBLEM - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.59 and port 4444: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:13:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter2003.codfw.wmnet [11:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:49] <_joe_> I guess you are :D [11:15:13] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:43] RECOVERY - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is OK: OK - Certificate appservers-rw.discovery.wmnet will expire on Mon 06 Jul 2026 02:13:19 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:15:43] (03PS3) 10Jbond: R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:16:15] joe: yes the process recreates pods. Sorry forgot to downtime mwdebug. I think we have the same as with apple-search here. Are you using the pyball fallback that its pooled anyway? then the service might not be reachable the last ~5 minutes [11:17:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter2003.codfw.wmnet [11:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:43] (03CR) 10jerkins-bot: [V: 04-1] R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond) [11:18:03] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [11:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:55] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add ceph packages in the octopus/bullseye combo. [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) [11:18:57] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [11:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter2004.codfw.wmnet [11:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [11:21:10] (03PS4) 10Jbond: R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:21:22] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Kormat) > Maybe we need to revisit the alerting for hosts if they start to send false alerts often. @Ladsgroup: I'm not following, why would a networking problem be a 'false' alert? [11:21:34] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [11:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:50] (03PS5) 10Jbond: R:uwsgi::app:Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:22:06] (03PS6) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:22:52] _joe_: I'm a bit concerned that the LogstashKafkaComsumerLag alert could be related to my re-deploy. Is this something I should take a look at now? [11:23:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:23:18] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [11:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:24] <_joe_> jelto: mostly to the messages being ingested by logstash I would say [11:23:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter2004.codfw.wmnet [11:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:39] <_joe_> godog: do you have better suggestions? [11:23:53] <_joe_> re: understanding what's causing the surge in logging [11:24:19] (03PS1) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114 [11:24:25] _joe_ jelto taking a look [11:25:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1141.eqiad.wmnet with reason: Maintenance T296143 [11:25:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1141.eqiad.wmnet with reason: Maintenance T296143 [11:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:14] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [11:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:18] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [11:25:20] (03PS7) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:25:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17808 and previous config saved to /var/cache/conftool/dbconfig/20211124-112539-ladsgroup.json [11:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:58] (03PS2) 10Jbond: wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 [11:26:18] oh yeah that's been active for a while heh [11:26:21] that == the alert [11:26:25] (03PS2) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114 [11:26:40] (03PS8) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:26:42] (03PS1) 10Majavah: Remove search.wikimedia.org files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741115 (https://phabricator.wikimedia.org/T289224) [11:26:59] (03CR) 10jerkins-bot: [V: 04-1] wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 (owner: 10Jbond) [11:27:29] (03PS9) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:27:33] jelto: lag started at 6 UTC, was also that when you began your activities ? [11:27:49] !log optimizing image.commonswiki in db1141 (T296143) [11:27:49] godog: no I started at 9 UTC today [11:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:54] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: add ceph packages in the octopus/bullseye combo [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) [11:28:10] (03PS10) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111 [11:28:22] jelto: yeah it must be something else, I'm taking a look anyways though [11:28:33] godog: great thanks a lot [11:29:07] RECOVERY - dump of m1 in codfw on alert1001 is OK: Last dump for m1 at codfw (db2078.codfw.wmnet:3321) taken on 2021-11-24 10:03:10 (31 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [11:29:14] (03CR) 10David Caro: aptrepo: add ceph packages in the octopus/bullseye combo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [11:30:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32602/console" [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond) [11:32:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [11:32:44] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [11:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:54] (03PS1) 10Jbond: rubocop: exclude lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/741117 [11:33:13] (03CR) 10Jbond: [C: 03+2] rubocop: exclude lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/741117 (owner: 10Jbond) [11:33:29] (03CR) 10Jbond: [V: 03+1 C: 03+2] R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond) [11:33:47] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) After deploying the changes to php-fatal-error.php, we can now see the error messages delivered by php-wmerrors in logstash. [11:34:12] (03PS3) 10Jbond: wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 [11:34:21] (03PS3) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114 [11:35:09] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [11:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:25] !log bounce apache2 on logstash1025 [11:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:18] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [11:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:36] (03CR) 10Muehlenhoff: aptrepo: add ceph packages in the octopus/bullseye combo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [11:37:50] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Fix UDS check cmd [puppet] - 10https://gerrit.wikimedia.org/r/741109 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:37:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM poolcounter2004.codfw.wmnet [11:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:51] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [11:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:40] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox' for release 'main' . [11:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:13] RECOVERY - dump of s1 in codfw on alert1001 is OK: Last dump for s1 at codfw (db2141.codfw.wmnet:3311) taken on 2021-11-24 09:53:28 (162 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [11:41:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM poolcounter2004.codfw.wmnet [11:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:55] (03PS1) 10Jbond: P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304) [11:42:04] (03PS1) 10Giuseppe Lavagetto: httpbb: move tests for search.wikimedia.org to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/741119 [11:42:45] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [11:42:45] (03PS4) 10Jbond: wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 [11:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM poolcounter2003.codfw.wmnet [11:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:48] (03PS4) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114 [11:44:24] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [11:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:40] (03PS2) 10Jbond: P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304) [11:45:09] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [11:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32604/console" [puppet] - 10https://gerrit.wikimedia.org/r/741114 (owner: 10Jbond) [11:45:57] (03CR) 10Jbond: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/741117 (owner: 10Jbond) [11:45:57] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [11:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:32] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [11:48:33] (03PS3) 10Jbond: P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304) [11:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:22] !log systemctl reset-failed ifup@ens5.service on poolcounter2003 T273026 [11:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:25] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [11:50:44] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [11:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM poolcounter2003.codfw.wmnet [11:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:36] (03PS3) 10Arturo Borrero Gonzalez: aptrepo: add ceph packages in the octopus/bullseye combo [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) [11:52:43] (03CR) 10Arturo Borrero Gonzalez: aptrepo: add ceph packages in the octopus/bullseye combo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [11:53:02] (03PS5) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114 [11:53:12] (03PS4) 10Jbond: P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304) [11:53:13] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' . [11:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [11:54:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32607/console" [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304) (owner: 10Jbond) [11:54:24] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [11:54:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM rpki2002.codfw.wmnet [11:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:17] 10SRE-swift-storage: Media storage metadata inconsistent with Swift - https://phabricator.wikimedia.org/T289996 (10jcrespo) [11:56:27] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [11:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:29] (03CR) 10David Caro: [C: 03+1] aptrepo: add ceph packages in the octopus/bullseye combo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [11:58:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM rpki2002.codfw.wmnet [11:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:12] (03PS3) 10WMDE-Fisch: VisualEditor template dialog: new sidebar and inline descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight) [11:58:15] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [11:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:58:49] ^ thats maybe me, however I have to take a look what routinator is [11:59:02] (03CR) 10Jbond: [C: 03+2] wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 (owner: 10Jbond) [11:59:05] (03CR) 10Jbond: [C: 03+2] R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114 (owner: 10Jbond) [11:59:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netbox-dev2001.wikimedia.org [11:59:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304) (owner: 10Jbond) [11:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1200). Please do the needful. [12:00:04] awight: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:09] I can deploy my patches :-) [12:00:12] ok :) [12:00:47] jelto: it's not you, it's moritzm's restart of rpki2002 [12:00:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [12:01:04] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [12:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:13] majavah: great thanks! [12:02:05] (03CR) 10Awight: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight) [12:02:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:02:46] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [12:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netbox-dev2001.wikimedia.org [12:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netbox2001.wikimedia.org [12:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:20] (03Merged) 10jenkins-bot: VisualEditor template dialog: new sidebar and inline descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight) [12:03:43] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [12:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:19] 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lucas_Werkmeister_WMDE) Would it be better on Commons if we set `$wgWBClientSettings['entityUsageModifierLimits']['C']` to 1 instead... [12:07:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netbox2001.wikimedia.org [12:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:51] (03CR) 10Btullis: [C: 03+2] analytics:refinery:job:refine_sanitize: Fix refine_monitor offsets [puppet] - 10https://gerrit.wikimedia.org/r/740931 (owner: 10Mforns) [12:10:05] !log awight@deploy1002 Synchronized wmf-config: Config: [[gerrit:740766|VisualEditor template dialog: new sidebar and inline descriptions (T284203, T286992)]] (duration: 00m 57s) [12:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:13] T286992: Deploy VE template dialog improvements to small set of wikis - https://phabricator.wikimedia.org/T286992 [12:10:13] T284203: Deploy inline descriptions, extended sidebar and bigger dialog to small set of wikis - https://phabricator.wikimedia.org/T284203 [12:10:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netboxdb2001.codfw.wmnet [12:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:15] (03PS2) 10Awight: [lint] fully-qualify classname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737193 [12:12:21] (03CR) 10Awight: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737193 (owner: 10Awight) [12:12:45] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:13:09] (03Merged) 10jenkins-bot: [lint] fully-qualify classname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737193 (owner: 10Awight) [12:13:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netboxdb2001.codfw.wmnet [12:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:50] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:15:18] (03PS1) 10Jelto: helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 [12:15:56] (03PS2) 10Awight: Replace global with parent scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737195 [12:16:06] (03CR) 10Awight: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737195 (owner: 10Awight) [12:16:31] !log awight@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:737193|[lint] fully-qualify classname]] (duration: 00m 55s) [12:16:32] (03PS1) 10Jbond: puppetboard - service: update puppetboard live check [puppet] - 10https://gerrit.wikimedia.org/r/741146 [12:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:46] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:16:55] (03CR) 10jerkins-bot: [V: 04-1] helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto) [12:16:59] 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on hywiki - https://phabricator.wikimedia.org/T296382 (10Lucas_Werkmeister_WMDE) [12:17:06] 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on warwiki - https://phabricator.wikimedia.org/T296383 (10Lucas_Werkmeister_WMDE) [12:17:16] 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 (10Lucas_Werkmeister_WMDE) [12:17:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32608/console" [puppet] - 10https://gerrit.wikimedia.org/r/741146 (owner: 10Jbond) [12:18:04] (03Merged) 10jenkins-bot: Replace global with parent scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737195 (owner: 10Awight) [12:18:54] (03CR) 10Btullis: [C: 03+2] Add more alerts to the data-engineering team [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [12:19:42] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [12:19:44] (03CR) 10Jbond: [V: 03+1] puppetboard - service: update puppetboard live check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741146 (owner: 10Jbond) [12:20:25] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:54] 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) >>! In T188730#7526191, @Lucas_Werkmeister_WMDE wrote: > Would it be better on Commons if we set `$wgWBClientSettings['en... [12:21:02] (03Merged) 10jenkins-bot: Add more alerts to the data-engineering team [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [12:21:12] 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on hywiki - https://phabricator.wikimedia.org/T296382 (10Lucas_Werkmeister_WMDE) 2021-11-24: `lang=shell lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql hywiki <<< 'SELEC... [12:21:41] 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on warwiki - https://phabricator.wikimedia.org/T296383 (10Lucas_Werkmeister_WMDE) 2021-11-24: `lang=shell lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql warwiki <<< 'SE... [12:21:45] !log awight@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:737195|Replace global with parent scope]] (duration: 00m 55s) [12:21:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM releases2002.codfw.wmnet [12:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:00] 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 (10Lucas_Werkmeister_WMDE) 2021-11-24: `lang=shell lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql cebwiki <<< 'SEL... [12:22:25] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2016.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled: mwdebug_4444: Servers kubernetes2004.codfw.wmnet, k [12:22:25] s2012.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:22:59] ^thats me, miscweb needs some extra care and downtime was a bit short [12:23:08] !log EU scap deployment finished [12:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:23:52] (03PS1) 10Jelto: miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 [12:24:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM releases2002.codfw.wmnet [12:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:34] (03CR) 10jerkins-bot: [V: 04-1] miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 (owner: 10Jelto) [12:25:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader2001.wikimedia.org [12:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader2001.wikimedia.org [12:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:36] (03CR) 10Btullis: superset: set webserver timeout to 180 seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771) (owner: 10Razzi) [12:29:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader2002.wikimedia.org [12:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:52] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [12:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:50] (03CR) 10Jbond: [C: 04-1] "need to update the following reference:" [puppet] - 10https://gerrit.wikimedia.org/r/740903 (owner: 10Dzahn) [12:32:09] (03CR) 10Jbond: [C: 03+2] public_cloud: Add public_clouds_shutdown to global config [puppet] - 10https://gerrit.wikimedia.org/r/740545 (owner: 10Jbond) [12:32:18] (03PS1) 10Jcrespo: argparams: Test edge cases [puppet] - 10https://gerrit.wikimedia.org/r/741152 [12:32:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader2002.wikimedia.org [12:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:51] (03PS2) 10Jcrespo: argparse: Test edge cases [puppet] - 10https://gerrit.wikimedia.org/r/741152 [12:33:10] (03Abandoned) 10Jbond: WIP: do not merge - CR to test varnish changes [puppet] - 10https://gerrit.wikimedia.org/r/740842 (owner: 10Jbond) [12:33:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:24] (03PS5) 10Jbond: R:varnish:instance: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) [12:33:32] (03PS3) 10Jcrespo: argparse: Test edge cases [puppet] - 10https://gerrit.wikimedia.org/r/741152 [12:33:34] (03PS10) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) [12:35:11] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [12:36:17] 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) Two ideas for improving the current design: - Normalize the table based on eu_aspect. - While this would have been som... [12:36:55] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [12:36:58] (03PS1) 10MMandere: admin: Add user taavi to wmcs and labtest group [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) [12:37:12] !log disable puppet for puppetdb reboot [12:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:41] PROBLEM - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.59 and port 4444: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:39:20] 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score: upload.wikimedia.org does not set content-encoding headers for Score-generated lilypond files - https://phabricator.wikimedia.org/T287326 (10TheDJ) 05Open→03Resolved a:03TheDJ [12:41:49] (03CR) 10Ssingh: [C: 03+1] "+1, uid matches and so do the groups (shell access)." [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere) [12:43:58] !log jbond@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM puppetdb2002.codfw.wmnet [12:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17809 and previous config saved to /var/cache/conftool/dbconfig/20211124-124420-ladsgroup.json [12:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:24] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [12:44:32] 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lucas_Werkmeister_WMDE) > * Split the table to wbc_property_usage and wbc_item_usage and use numeric ids there. > * I don't know w... [12:45:53] (03PS2) 10Muehlenhoff: Add Cumin alias for wcqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/741084 [12:46:08] !log jelto@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=(apertium|api-gateway|apple-search|blubberoid|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventstreams|eventstreams-internal|linkrecommendation|mathoid|mobileapps|proton|push-notifications|recommendation-api|sessionstore|shellbox|shellbox-constraints|shellbox-media|shellbox-syntaxh [12:46:08] ighlight|shellbox-timeline|similar-users|tegola-vector-tiles|termbox|wikifeeds|zotero) [12:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:13] (03CR) 10Muehlenhoff: Add Cumin alias for wcqs hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff) [12:47:43] !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetdb2002.codfw.wmnet [12:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:20] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:33] !log enable puppet post puppetdb reboot [12:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:09] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes2007.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:50:32] 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) Yeah, then possibly split that column into two, one numeric id and numeric identifier of the entity type (item=0, propert... [12:50:39] RECOVERY - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is OK: OK - Certificate appservers-rw.discovery.wmnet will expire on Mon 06 Jul 2026 02:13:19 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:51:23] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:51:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM grafana2001.codfw.wmnet [12:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:13] (03CR) 10Jelto: [C: 03+2] hiera::role::common::deployment_server update helmBinary codfw [puppet] - 10https://gerrit.wikimedia.org/r/736822 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:53:21] (03PS2) 10Jelto: hiera::role::common::deployment_server update helmBinary codfw [puppet] - 10https://gerrit.wikimedia.org/r/736822 (https://phabricator.wikimedia.org/T251305) [12:53:24] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 (10Marostegui) [12:53:33] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:53:52] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on warwiki - https://phabricator.wikimedia.org/T296383 (10Marostegui) [12:53:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM grafana2001.codfw.wmnet [12:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:38] (03PS1) 10Klausman: site: Move non-vm ML machines in codfw to setup for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/741154 (https://phabricator.wikimedia.org/T294412) [12:54:40] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on hywiki - https://phabricator.wikimedia.org/T296382 (10Marostegui) [12:54:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM apt2001.wikimedia.org [12:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:24] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Marostegui) [12:58:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [12:59:51] 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Ladsgroup) >>! In T295952#7526103, @Kormat wrote: >> Maybe we need to revisit the alerting for hosts if they start to send false alerts often. > > @Ladsgroup: I'm not following, why would a networ... [13:00:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM apt2001.wikimedia.org [13:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17810 and previous config saved to /var/cache/conftool/dbconfig/20211124-130200-ladsgroup.json [13:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:04] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [13:04:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:37] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:06] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [13:07:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:23] (03PS2) 10Muehlenhoff: Point irc.wikimedia.org to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/740864 (https://phabricator.wikimedia.org/T294119) [13:07:27] PROBLEM - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:10:50] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff) [13:12:53] 10Puppet, 10Infrastructure-Foundations, 10Testing-Roadblocks, 10User-jbond: Allow using WMCS hiera lookup order in Puppet rspec tests - https://phabricator.wikimedia.org/T296327 (10jbond) @Majavah I have had a think about this and i don't think that it will work as expected. Currently the shared spec help... [13:13:28] (03CR) 10Gehel: [C: 03+1] Add Cumin alias for wcqs hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff) [13:15:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17811 and previous config saved to /var/cache/conftool/dbconfig/20211124-131519-ladsgroup.json [13:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:24] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [13:15:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/740913 (owner: 10Majavah) [13:17:50] (03CR) 10Muehlenhoff: Add Cumin alias for wcqs hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff) [13:19:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:22:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-corp2001.wikimedia.org [13:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:34] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:25:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-corp2001.wikimedia.org [13:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:16] (03CR) 10Muehlenhoff: [C: 03+2] Point irc.wikimedia.org to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/740864 (https://phabricator.wikimedia.org/T294119) (owner: 10Muehlenhoff) [13:27:27] !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2004.codfw.wmnet [13:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:29] !log filippo@cumin1001 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM logstash2004.codfw.wmnet [13:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:51] !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2004.codfw.wmnet [13:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica2005.wikimedia.org [13:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:11] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2004.codfw.wmnet [13:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica2005.wikimedia.org [13:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:05] (03CR) 10Ayounsi: [C: 03+2] Add read-only access for jayme [homer/public] - 10https://gerrit.wikimedia.org/r/741108 (owner: 10JMeybohm) [13:31:41] (03Merged) 10jenkins-bot: Add read-only access for jayme [homer/public] - 10https://gerrit.wikimedia.org/r/741108 (owner: 10JMeybohm) [13:33:07] (03PS4) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 [13:33:57] (03CR) 10jerkins-bot: [V: 04-1] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [13:34:44] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [13:34:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica2006.wikimedia.org [13:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:53] (03CR) 10Jcrespo: "I am not sure unit tests are running by default locally or remotelly, but the patch works when tested specifically:" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [13:35:21] !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2005.codfw.wmnet [13:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:20] !log add Jayme r/o user to all network devices [13:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17812 and previous config saved to /var/cache/conftool/dbconfig/20211124-133628-ladsgroup.json [13:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:33] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [13:37:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica2006.wikimedia.org [13:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:37] I'm about to use a script to depool db1142 automatically, if it misbehaves, don't worry [13:37:51] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2005.codfw.wmnet [13:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17813 and previous config saved to /var/cache/conftool/dbconfig/20211124-133809-ladsgroup.json [13:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:39:21] !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2006.codfw.wmnet [13:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1142.eqiad.wmnet with reason: Maintenance T296143 [13:39:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1142.eqiad.wmnet with reason: Maintenance T296143 [13:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:56] (03PS5) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 [13:41:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:41:14] (03PS6) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 [13:41:35] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2006.codfw.wmnet [13:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:43] (03PS7) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 [13:43:12] (03CR) 10Jbond: argparse: Fix number of parameters when String argument contains spaces (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [13:43:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff) [13:44:57] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [13:49:26] !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2023.codfw.wmnet [13:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:39] (03PS8) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 [13:50:44] (03PS2) 10Klausman: site: Move non-vm ML machines in codfw to setup for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/741154 (https://phabricator.wikimedia.org/T294412) [13:51:18] (03CR) 10Jcrespo: "Should we escape also existing double quotes? 'lol"lol' => "lol\"lol" ?" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [13:51:23] (03CR) 10Jbond: "LGTM but will need nskaggs approval" [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere) [13:52:31] (03PS3) 10Klausman: site: Move non-vm ML machines in codfw to setup for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/741154 (https://phabricator.wikimedia.org/T294412) [13:52:40] (03CR) 10Kormat: "I can't speak for the release/yaml stuff, but the rest LGTM. 2 minor comments." [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [13:53:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10jbond) >>! In T296192#7525855, @Urbanecm wrote: > This has my support. Majavah is very helpful, and this level of access would definitel... [13:54:22] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [13:54:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM serpens.wikimedia.org [13:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:07] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2023.codfw.wmnet [13:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:28] (03PS1) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) [13:56:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [13:56:36] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:57:38] (03PS2) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) [13:58:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM serpens.wikimedia.org [13:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:11] (03CR) 10Muehlenhoff: "Let's also add an approval: line for those two groups and set it to Nicholas, please." [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere) [14:00:49] !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2024.codfw.wmnet [14:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:04:56] (03PS9) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 [14:05:12] (03CR) 10Jcrespo: "Another take-- let me know what you thing." [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:06:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:06:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM idp-test2001.wikimedia.org [14:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:54] (03Abandoned) 10Klausman: site: Move non-vm ML machines in codfw to setup for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/741154 (https://phabricator.wikimedia.org/T294412) (owner: 10Klausman) [14:08:19] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) The re-deploy of codfw was successful. Some take-aways are added here which came up in the codfw migration. The plan to migrate eqiad Kubernetes to `helm3`: * Announce maintenanc... [14:08:34] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [14:09:25] (03CR) 10Jbond: argparse: Fix number of parameters when String argument contains spaces (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:09:30] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:10:27] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2024.codfw.wmnet [14:10:29] (03PS1) 10Jelto: hiera::role::common::deployment_server update helmBinary eqiad [puppet] - 10https://gerrit.wikimedia.org/r/741681 (https://phabricator.wikimedia.org/T251305) [14:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test2001.wikimedia.org [14:10:34] PROBLEM - Check systemd state on logstash2024 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:44] (03PS3) 10Jcrespo: mariadb: Split the dbstore_multiinstance role into two others [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) [14:14:37] (03CR) 10Jcrespo: mariadb: Split the dbstore_multiinstance role into two others (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [14:15:29] !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2025.codfw.wmnet [14:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:16] (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:16:52] (03CR) 10Jbond: argparse: Fix number of parameters when String argument contains spaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:19:17] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:19:18] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:28] (03CR) 10Muehlenhoff: [C: 03+1] "+1 on the profile::contacts::role_contacts/Cumin alias changes and one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [14:20:53] (03CR) 10Jcrespo: "Doing most of that- although it is getting confusing." [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:21:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:21:12] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2025.codfw.wmnet [14:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM idp2001.wikimedia.org [14:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:42] (03CR) 10Jcrespo: [C: 03+1] mariadb: Split the dbstore_multiinstance role into two others (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [14:24:44] (03CR) 10Jbond: argparse: Fix number of parameters when String argument contains spaces (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:25:21] (03CR) 10Muehlenhoff: [C: 03+1] mariadb: Split the dbstore_multiinstance role into two others (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [14:26:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:26:14] !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2030.codfw.wmnet [14:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp2001.wikimedia.org [14:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:19] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:27:20] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:38] _joe_: good news the Zuul queue overflow alarm no more shows up in this channel / sre :) [14:28:16] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:28:17] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:53] !log systemctl reset-failed ifup@ens5.service on logstash2024 T273026 [14:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:56] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [14:30:13] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:17] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh2001.wikimedia.org [14:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:22] RECOVERY - Check systemd state on logstash2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:12] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:15] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:21] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, and 5 others: Refactor puppet:base module to reduce unneeded shared code paths - https://phabricator.wikimedia.org/T289661 (10dcaro) [14:31:29] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, and 5 others: Refactor puppet:base module to reduce unneeded shared code paths - https://phabricator.wikimedia.org/T289661 (10dcaro) [14:31:52] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2030.codfw.wmnet [14:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:57] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:01] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:08] (03PS3) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) [14:34:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh2001.wikimedia.org [14:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:40] (03PS3) 10Jbond: puppetmaster - hiera: order site after role [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [14:35:57] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [14:36:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh2002.wikimedia.org [14:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:32] (03PS10) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 [14:36:54] !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2031.codfw.wmnet [14:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:23] (03CR) 10Jcrespo: "I think this does what you suggested- please forgive if I missed something :-)" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:39:04] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2031.codfw.wmnet [14:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh2002.wikimedia.org [14:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:24] (03CR) 10Btullis: [C: 03+1] "Looks good. +1" [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [14:42:48] (03CR) 10Jcrespo: [C: 03+1] "Looking good: https://puppet-compiler.wmflabs.org/compiler1002/32610/" [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [14:44:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum2001.codfw.wmnet [14:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:32] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10fgiunchedi) [14:45:39] (03PS2) 10MMandere: admin: Add user taavi to wmcs and labtest group [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) [14:46:51] (03CR) 10Ssingh: [C: 03+1] "+1, addresses moritzm's comment." [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere) [14:47:20] (03CR) 10MMandere: admin: Add user taavi to wmcs and labtest group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere) [14:48:41] (03CR) 10Jcrespo: [C: 03+1] "I wouldn't touch it for this patch scope, but more than open to change it on a followup patch, suggestions?" [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [14:49:23] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:29] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:37] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum2001.codfw.wmnet [14:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM search-loader2001.codfw.wmnet [14:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:00] btullis, about to deploy gerrit:740815, I expect noop, but pinging thinking about the worst [14:52:39] I will test it quickly on 2 hosts, revery if something unexpected happens [14:52:43] ack, thanks jynus. [14:52:55] and we can keep talking on the ticket [14:53:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10nskaggs) Yes, this has my support. Thank you! [14:53:12] RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:22] (03CR) 10Jcrespo: [C: 03+2] mariadb: Split the dbstore_multiinstance role into two others [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [14:54:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM search-loader2001.codfw.wmnet [14:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:18] (03PS11) 10Jbond: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:55:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum2002.codfw.wmnet [14:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:02] (03CR) 10Jbond: [C: 03+1] "LGTM, thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:56:11] (03CR) 10Nskaggs: [C: 03+1] "+1 from me for Taavi. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere) [14:56:21] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:57:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17815 and previous config saved to /var/cache/conftool/dbconfig/20211124-145721-ladsgroup.json [14:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:25] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [14:57:46] (03CR) 10MMandere: [C: 03+2] admin: Add user taavi to wmcs and labtest group [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere) [14:57:49] btullis, all good, only thing that changed was motd and "contacts.yaml" (no idea what that is used for, but all expected) [14:58:05] jynus: Great, many thanks. [14:58:12] (03PS12) 10Jbond: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [14:59:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [14:59:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum2002.codfw.wmnet [14:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:42] 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) @elukey thanks [14:59:46] PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:47] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:08] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/32609/" [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [15:00:51] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, 10Patch-For-Review: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10jcrespo) Deployment went as expected- but now that I thought a bit, I think btull... [15:01:37] (03CR) 10Jbond: [C: 03+1] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [15:01:46] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [15:02:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow2001.codfw.wmnet [15:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:05] (03CR) 10Jcrespo: [C: 03+1] "You were right- sorry, so many micro changes made things confusing. **Please go ahead and deploy at your convenience** if you want! This w" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [15:03:33] (03CR) 10Jcrespo: [C: 04-1] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [15:03:54] (03CR) 10Jcrespo: [C: 04-1] "wait, I saw a few deprecated comments. I think." [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [15:04:22] (03PS1) 10Jbond: wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686 [15:04:57] (03PS13) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 [15:05:34] (03CR) 10Jcrespo: [C: 03+1] "That should be it." [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [15:05:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow2001.codfw.wmnet [15:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:21] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:26] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.or [15:06:27] etwork_monitoring%23BGP_status [15:06:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:32] (03PS2) 10Jbond: wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686 [15:07:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.or [15:07:18] etwork_monitoring%23BGP_status [15:07:36] ta daaaan [15:07:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere) [15:07:55] this is me and Tobias working on the codfw cluster, some issues with calico [15:08:18] (03CR) 10Jbond: [C: 03+2] argparse: Fix number of parameters when String argument contains spaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [15:08:26] (KubernetesCalicoDown) firing: ml-serve-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:08:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM gitlab2001.wikimedia.org [15:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:44] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:59] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [15:09:24] 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) swapped DIMM B2 with DIMM A4 [15:09:42] RECOVERY - SSH on kubernetes1003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:12:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17817 and previous config saved to /var/cache/conftool/dbconfig/20211124-151226-ladsgroup.json [15:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:30] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [15:12:34] RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.05 ms [15:13:49] (KubernetesCalicoDown) firing: (4) ml-serve-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:14:04] PROBLEM - puppet last run on ms-be2058 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:14:31] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10Majavah) ` taavi@runko ~> ssh cloudcontrol1003.wikimedia.org Linux cloudcontrol1003 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 Debian GNU/Li... [15:14:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM gitlab2001.wikimedia.org [15:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:05] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [15:17:41] (03PS14) 10Jbond: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [15:17:59] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM dragonfly-supernode2001.codfw.wmnet [15:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:05] (03PS3) 10Jbond: wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686 [15:18:24] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere) 05Open→03Resolved a:03MMandere Thank you too @Majavah for confirming access. [15:18:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: move tests for search.wikimedia.org to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/741119 (owner: 10Giuseppe Lavagetto) [15:20:14] RECOVERY - puppet last run on ms-be2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:20:33] (03CR) 10Jcrespo: [C: 03+1] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [15:21:02] (03CR) 10jerkins-bot: [V: 04-1] wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686 (owner: 10Jbond) [15:21:40] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dragonfly-supernode2001.codfw.wmnet [15:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:34] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:22:42] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm) [15:23:04] (03CR) 10Jbond: [C: 03+2] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo) [15:23:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [15:23:26] (KubernetesCalicoDown) firing: (4) ml-serve-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:23:41] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster2001.codfw.wmnet [15:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:52] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 104, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:57] downtiming nodes [15:24:21] (03PS1) 10Jcrespo: mariadb: Remove :: from profile setup on 2 mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/741689 (https://phabricator.wikimedia.org/T296285) [15:25:56] (03PS1) 10Giuseppe Lavagetto: httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690 [15:26:27] (03CR) 10Jcrespo: "As promised ;-) https://puppet-compiler.wmflabs.org/compiler1002/32611/" [puppet] - 10https://gerrit.wikimedia.org/r/741689 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [15:26:30] 10SRE, 10ops-ulsfo: Update PDUs name-server config - https://phabricator.wikimedia.org/T295668 (10Papaul) [15:26:42] (03CR) 10jerkins-bot: [V: 04-1] httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690 (owner: 10Giuseppe Lavagetto) [15:26:44] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690 (owner: 10Giuseppe Lavagetto) [15:27:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17819 and previous config saved to /var/cache/conftool/dbconfig/20211124-152731-ladsgroup.json [15:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:36] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [15:27:40] (03PS2) 10Giuseppe Lavagetto: httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690 [15:27:53] <_joe_> sigh I'm on a roll today [15:28:26] (KubernetesCalicoDown) resolved: (4) ml-serve-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:29:05] \o/ [15:30:01] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagemaster2001.codfw.wmnet [15:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:42] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm) [15:31:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690 (owner: 10Giuseppe Lavagetto) [15:31:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM irc2001.wikimedia.org [15:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:25] (03PS4) 10Jbond: wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686 [15:32:40] !log reboot ms-be2058 for firmware upgrade [15:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:16] (03PS2) 10Giuseppe Lavagetto: Remove search.wikimedia.org from appservers [puppet] - 10https://gerrit.wikimedia.org/r/741079 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [15:33:46] PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:09] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM schema2003.codfw.wmnet [15:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM irc2001.wikimedia.org [15:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kafkamon2002.codfw.wmnet [15:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM schema2003.codfw.wmnet [15:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:24] RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.07 ms [15:38:07] (03CR) 10Hashar: "Not sure why rubocop did not complaint when I have send the original change. Anyway thank you for the follow up!" [puppet] - 10https://gerrit.wikimedia.org/r/741117 (owner: 10Jbond) [15:39:10] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM schema2004.codfw.wmnet [15:39:12] (03PS1) 10Vgutierrez: cache::haproxy: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005) [15:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafkamon2002.codfw.wmnet [15:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:39] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [15:41:41] (03CR) 10Kormat: partmon: add reuse partmon profile for cassandra hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [15:42:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17820 and previous config saved to /var/cache/conftool/dbconfig/20211124-154236-ladsgroup.json [15:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:40] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [15:43:16] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [15:44:13] (03CR) 10Kormat: [C: 03+1] mariadb: Remove :: from profile setup on 2 mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/741689 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [15:45:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1143.eqiad.wmnet with reason: Maintenance T296143 [15:45:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1143.eqiad.wmnet with reason: Maintenance T296143 [15:45:30] (03CR) 10Jbond: rubocop: exclude lintian-junit-report (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741117 (owner: 10Jbond) [15:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17821 and previous config saved to /var/cache/conftool/dbconfig/20211124-154533-ladsgroup.json [15:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:55] (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove :: from profile setup on 2 mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/741689 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo) [15:48:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM schema2004.codfw.wmnet [15:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:24] PROBLEM - Check systemd state on schema2004 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:49] (03PS1) 10Ladsgroup: rdbms: Make TransactionProfiler logs more useful [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741132 (https://phabricator.wikimedia.org/T295706) [15:49:36] jouncebot: nowandnext [15:49:36] No deployments scheduled for the next 3 hour(s) and 10 minute(s) [15:49:36] In 3 hour(s) and 10 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1900) [15:49:36] In 3 hour(s) and 10 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1900) [15:49:41] nice [15:49:48] (03CR) 10Ladsgroup: [C: 03+2] rdbms: Make TransactionProfiler logs more useful [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741132 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup) [15:50:17] (03PS2) 10Vgutierrez: cache::haproxy: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005) [15:51:22] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32613/console" [puppet] - 10https://gerrit.wikimedia.org/r/741079 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [15:51:47] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] Remove search.wikimedia.org from appservers [puppet] - 10https://gerrit.wikimedia.org/r/741079 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [15:52:36] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32614/console" [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:55:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) I received 10 out of 18 hosts. Can someone please update the racking information? Thanks [15:55:17] (03CR) 10Jbond: [C: 03+2] wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686 (owner: 10Jbond) [15:59:20] RECOVERY - Check systemd state on schema2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:49] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [16:00:09] !log systemctl reset-failed ifup@ens5.service on schema2004 T273026 [16:00:10] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:14] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:00:15] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [16:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) Moved debate into {T296411} [16:02:53] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff) [16:07:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [16:07:51] (03PS1) 10Ssingh: test_dns: add a DoT check against all doh* hosts [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/741698 [16:08:59] (03Merged) 10jenkins-bot: rdbms: Make TransactionProfiler logs more useful [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741132 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup) [16:09:15] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:18] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:20] (03CR) 10Ssingh: [C: 03+2] test_dns: add a DoT check against all doh* hosts [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/741698 (owner: 10Ssingh) [16:13:30] !log start of "foreachwikiindblist s3 migrateRevisionActorTemp.php --sleep=2" in mwmaint1002 in a screen. It will take a month or so (T275246) [16:13:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:34] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [16:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:58] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster2002.codfw.wmnet [16:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [16:19:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [16:19:09] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster2002.codfw.wmnet [16:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:21] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster2001.codfw.wmnet [16:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:58] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm) [16:23:02] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubestagetcd2001.codfw.wmnet [16:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [16:23:35] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster2001.codfw.wmnet [16:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:27] !log mforns@deploy1002 Started deploy [analytics/refinery@6253399]: Regular analytics weekly train [analytics/refinery@6253399] [16:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagetcd2001.codfw.wmnet [16:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:21] (03Merged) 10jenkins-bot: mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [16:29:59] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubestagetcd2003.codfw.wmnet [16:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:13] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm) [16:30:36] (03CR) 10Kormat: partmon: add reuse partmon profile for cassandra hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [16:31:28] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubetcd2004.codfw.wmnet [16:31:29] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagetcd2003.codfw.wmnet [16:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:04] (03PS1) 10Giuseppe Lavagetto: service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703 [16:33:00] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:02] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:39] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubetcd2004.codfw.wmnet [16:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:57] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32615/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [16:33:58] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubestagetcd2002.codfw.wmnet [16:33:59] testing done, moving forward [16:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:51] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32616/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [16:35:04] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm) [16:35:07] PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% [16:35:08] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) Thanks a lot @jbond for all the info, I have other questions/doubts in mind, I think that we are close to find a solution but I feel that some things needs to be discussed first. 1) p12/jks bundles The `... [16:35:56] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/includes/libs/rdbms/: Backport: [[gerrit:741132|rdbms: Make TransactionProfiler logs more useful (T295706)]] (duration: 00m 57s) [16:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:00] T295706: Improve TransactionProfiler as replacement for tendril's slow queries - https://phabricator.wikimedia.org/T295706 [16:36:02] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubetcd2006.codfw.wmnet [16:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:18] (03PS2) 10Giuseppe Lavagetto: service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703 [16:36:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagetcd2002.codfw.wmnet [16:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32617/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [16:37:51] (03CR) 10Cwhite: [C: 03+2] upgrade ecs to 1.11.0 [software/ecs] - 10https://gerrit.wikimedia.org/r/735417 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite) [16:37:59] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:38:13] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubetcd2006.codfw.wmnet [16:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:29] (03Merged) 10jenkins-bot: upgrade ecs to 1.11.0 [software/ecs] - 10https://gerrit.wikimedia.org/r/735417 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite) [16:40:57] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:00] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:36] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubetcd2005.codfw.wmnet [16:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:55] RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms [16:42:00] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm) [16:42:43] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:53] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:57] (03PS3) 10Giuseppe Lavagetto: service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703 [16:43:03] 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) Firmware upgrade complete on the server. leaving the server up to see if the error shows on DIMM A4 [16:43:17] (03PS1) 10JHathaway: admin: Add myself (jhathaway) [puppet] - 10https://gerrit.wikimedia.org/r/741705 [16:43:19] (03PS1) 10JHathaway: admin: Add myself(jhathaway) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/741706 [16:43:26] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:48] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubetcd2005.codfw.wmnet [16:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:00] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:44:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [16:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:27] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32618/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [16:45:30] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ host wikifunctions.beta.wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/714068 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [16:46:49] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:47:04] (03PS1) 10Cwhite: logstash: deploy ecs 1.11.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/741707 (https://phabricator.wikimedia.org/T294581) [16:47:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:47:20] (03PS3) 10Razzi: superset: set webserver timeout to 180 seconds [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771) [16:47:38] (03CR) 10Jobo: [V: 03+2] admin: Add myself(jhathaway) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/741706 (owner: 10JHathaway) [16:48:00] (03CR) 10Dzahn: "oh, thank you for that. one time I was able to deploy just fine, the other times I wasn't and it timed out. as mentioned before it's not t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto) [16:48:34] (03CR) 10Jobo: [V: 03+2] admin: Add myself (jhathaway) [puppet] - 10https://gerrit.wikimedia.org/r/741705 (owner: 10JHathaway) [16:49:38] (03PS2) 10Cwhite: logstash: deploy ecs 1.11.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/741707 (https://phabricator.wikimedia.org/T294581) [16:49:59] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:14] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:38] (03PS1) 10Majavah: hieradata: fix beta wikifunction setup [puppet] - 10https://gerrit.wikimedia.org/r/741708 [16:51:17] mutante: James_F: https://gerrit.wikimedia.org/r/c/operations/puppet/+/741708/ [16:51:27] (03PS4) 10Giuseppe Lavagetto: service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703 [16:52:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [16:52:49] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32619/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [16:53:22] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "This change is now a NOOP on lvs1016, so I think it should be good to go." [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [16:55:21] (03CR) 10Dzahn: gitlab: restore script keep_config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [16:55:26] (03CR) 10Jbond: [C: 03+2] admin: Add myself (jhathaway) [puppet] - 10https://gerrit.wikimedia.org/r/741705 (owner: 10JHathaway) [16:56:24] (03CR) 10Arturo Borrero Gonzalez: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [16:56:45] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2005.codfw.wmnet [16:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:47] (03CR) 10Jbond: [C: 03+2] admin: Add myself(jhathaway) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/741706 (owner: 10JHathaway) [16:56:57] majavah: WF won't be a multilingual site? [16:57:05] it won't? [16:57:10] (03CR) 10Dzahn: [C: 03+2] hieradata: fix beta wikifunction setup [puppet] - 10https://gerrit.wikimedia.org/r/741708 (owner: 10Majavah) [16:57:21] wait.. :) [16:57:28] Oh, you mean a single site with multiple languages, unlike WP which is multiple sites each with one language? [16:57:30] I thought that it will like wikidata etc [16:57:42] Yeah, we're like Wikidata. [16:57:53] ACK. i will keep merging [16:58:05] But api.wikifunctions.org (and api.wikifunctions.beta.wmflabs.org) will be a non-MediaWiki install; is that OK? [16:58:15] or not, because 2 pending merges on master [16:58:18] !log mforns@deploy1002 Finished deploy [analytics/refinery@6253399]: Regular analytics weekly train [analytics/refinery@6253399] (duration: 32m 50s) [16:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:35] !log mforns@deploy1002 Started deploy [analytics/refinery@6253399] (thin): Regular analytics weekly train THIN [analytics/refinery@6253399] [16:58:35] and they are access related.. so.. i'll give it a few [16:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:42] !log mforns@deploy1002 Finished deploy [analytics/refinery@6253399] (thin): Regular analytics weekly train THIN [analytics/refinery@6253399] (duration: 00m 07s) [16:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:54] James_F: yeah, it's fine, it need to be set up separately anyways [16:58:55] !log mforns@deploy1002 Started deploy [analytics/refinery@6253399] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6253399] [16:58:56] mutante, majavah: Thank you both. [16:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:58] majavah: thanks! not merged on master just yet [16:59:41] * James_F nods. [17:00:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:00:20] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) @Dzahn Just an idea, but if we create an alias of some... [17:00:27] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2005.codfw.wmnet [17:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:51] 10SRE, 10Observability-Logging: Develop tooling for quickly parsing 5xx and sampled-1000 logs - https://phabricator.wikimedia.org/T292682 (10lmata) [17:01:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17826 and previous config saved to /var/cache/conftool/dbconfig/20211124-170100-ladsgroup.json [17:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:04] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [17:01:17] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM registry2003.codfw.wmnet [17:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:48] mutante: happy for me to merge your change [17:01:59] jbond: ok, please do. cloud/beta only :) [17:02:16] cooll thanks [17:02:18] majavah: James_F: now [17:02:27] thanks as well [17:02:32] (03PS1) 10Ladsgroup: rdbms: Add full query to transaction profiler [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741134 (https://phabricator.wikimedia.org/T295706) [17:03:04] (03CR) 10Ladsgroup: [C: 03+2] rdbms: Add full query to transaction profiler [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741134 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup) [17:05:28] (03CR) 10Krinkle: alertmanager: Update address for perf-team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle) [17:05:39] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry2003.codfw.wmnet [17:05:40] !log mforns@deploy1002 Finished deploy [analytics/refinery@6253399] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6253399] (duration: 06m 45s) [17:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:14] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2006.codfw.wmnet [17:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:54] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM registry2004.codfw.wmnet [17:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:26] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm) [17:07:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10ayounsi) [17:08:31] 10SRE, 10Observability-Alerting: Icinga meta monitoring pages during icinga host reboots - https://phabricator.wikimedia.org/T274662 (10lmata) [17:08:45] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw [17:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:51] 10SRE, 10Citoid, 10Observability-Logging, 10Wikimedia-Logstash, and 3 others: Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10lmata) [17:09:55] (03Abandoned) 10Dzahn: rename base/files/labs to base/files/cloud [puppet] - 10https://gerrit.wikimedia.org/r/740903 (owner: 10Dzahn) [17:10:01] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10lmata) [17:10:31] (03CR) 10Dzahn: [C: 04-1] "it first needs https://phabricator.wikimedia.org/T296331#7525107" [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [17:11:13] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry2004.codfw.wmnet [17:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:25] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2006.codfw.wmnet [17:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:38] James_F: the domain is now configured on the apache side (and I purged the previous different error messages from the caches after getting confused) and requests are now making to mwmultiversion [17:16:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17827 and previous config saved to /var/cache/conftool/dbconfig/20211124-171604-ladsgroup.json [17:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:10] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [17:17:00] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM chartmuseum2001.codfw.wmnet [17:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:17] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) @Krinkle sure, always a good idea to replace hardcoded ho... [17:17:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [17:17:36] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10akosiaris) Thanks @papaul. We 'll get back to you! [17:17:59] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2015.codfw.wmnet [17:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:09] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2015.codfw.wmnet [17:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:22] (03CR) 10Razzi: [C: 03+1] "LGTM, would you like to pair on deploying this, Andrew?" [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [17:20:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM chartmuseum2001.codfw.wmnet [17:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:40] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2016.codfw.wmnet [17:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:44] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:21:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:48] (03CR) 10Dzahn: [V: 03+1 C: 03+2] cache/text_haproxy: remove scholarships.wikimedia.org config [puppet] - 10https://gerrit.wikimedia.org/r/740907 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [17:22:13] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm) [17:23:00] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw [17:23:00] (03CR) 10Krinkle: "This is uncontroversial to merge as far as I'm concerned. I've checked the two hosts via ssh, they're up, have the same role as doc1001, a" [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [17:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [17:23:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2016.codfw.wmnet [17:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:32] (03Merged) 10jenkins-bot: rdbms: Add full query to transaction profiler [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741134 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup) [17:25:18] (03CR) 10Andrew Bogott: [C: 03+2] Delete roles for bare metal WMCS puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/740913 (owner: 10Majavah) [17:25:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:36] (03PS1) 10Jbond: no op change to demo puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/741710 [17:25:38] (03PS1) 10Jbond: no op change to demo puppet-mere [puppet] - 10https://gerrit.wikimedia.org/r/741711 [17:26:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:26:04] end of year = more people start to delete stuff :) [17:26:17] (03Restored) 10Hashar: scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [17:26:25] (03CR) 10Andrew Bogott: [C: 03+2] puppet_alert: Condider zero resources a failure [puppet] - 10https://gerrit.wikimedia.org/r/740897 (owner: 10Majavah) [17:26:39] bbiaw, afk [17:27:01] (03CR) 10Jbond: [C: 03+2] no op change to demo puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/741710 (owner: 10Jbond) [17:27:07] (03CR) 10Jbond: [C: 03+2] no op change to demo puppet-mere [puppet] - 10https://gerrit.wikimedia.org/r/741711 (owner: 10Jbond) [17:27:09] (03PS1) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) [17:27:21] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [17:27:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:45] (03PS5) 10Hashar: scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [17:28:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [17:28:40] (03CR) 10Hashar: "Requested by Timo, we can indeed have integration/docroot deployed to all hosts even if there is little bandwidth now to do the switch." [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [17:28:48] (03CR) 10Hashar: [C: 03+1] scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [17:29:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17828 and previous config saved to /var/cache/conftool/dbconfig/20211124-173110-ladsgroup.json [17:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:14] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [17:31:17] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:06] (03PS2) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) [17:33:17] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [17:34:28] !log jhathaway@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=puppetboard [17:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:03] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/includes/libs/rdbms/: Backport: [[gerrit:741134|rdbms: Add full query to transaction profiler (T295706)]] (duration: 00m 56s) [17:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:06] T295706: Improve TransactionProfiler as replacement for tendril's slow queries - https://phabricator.wikimedia.org/T295706 [17:39:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:40:35] 10ops-eqiad, 10DC-Ops: Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10RobH) [17:40:48] 10ops-eqiad, 10DC-Ops: Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10RobH) [17:41:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:41:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Seems like a nice refactor. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [17:41:45] 10ops-eqiad, 10DC-Ops: Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10RobH) [17:44:20] (03PS3) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) [17:44:37] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [17:46:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17829 and previous config saved to /var/cache/conftool/dbconfig/20211124-174615-ladsgroup.json [17:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:19] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [17:47:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1144.eqiad.wmnet with reason: Maintenance T296143 [17:47:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1144.eqiad.wmnet with reason: Maintenance T296143 [17:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17830 and previous config saved to /var/cache/conftool/dbconfig/20211124-174723-ladsgroup.json [17:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:48:37] (03CR) 10BryanDavis: [C: 03+1] "See https://yaml.org/type/merge.html and https://ktomk.github.io/writing/yaml-anchor-alias-and-merge-key.html if you are unfamiliar with h" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [17:53:10] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771) (owner: 10Razzi) [17:54:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:54:11] (03Abandoned) 10Majavah: hieradata: Route search.wm.o to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [17:59:03] (03PS1) 10Majavah: P::doc: use correct php_fpm path [puppet] - 10https://gerrit.wikimedia.org/r/741715 [17:59:40] (03PS2) 10Majavah: P::doc: use correct php_fpm path [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) [18:00:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:01:56] (03CR) 10DCausse: [C: 03+1] cirrussearch: s/sanitizer/saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/740711 (https://phabricator.wikimedia.org/T295705) (owner: 10Ryan Kemper) [18:02:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [18:04:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:09:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:11:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:12:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [18:13:13] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Majavah) >>! In T247653#7527389, @Dzahn wrote: >> should the new... [18:14:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [18:14:07] RECOVERY - dump of s1 in eqiad on alert1001 is OK: Last dump for s1 at eqiad (db1140.eqiad.wmnet:3311) taken on 2021-11-24 09:48:02 (162 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [18:20:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:24:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:30:14] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM acmechief2001.codfw.wmnet [18:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:21] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM acmechief2001.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None [18:30:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:34:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [18:35:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:36:30] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM acmechief2001.codfw.wmnet [18:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:55] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM acmechief-test2001.codfw.wmnet [18:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:03] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM acmechief-test2001.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None [18:41:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:42:12] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM acmechief-test2001.codfw.wmnet [18:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:30] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ncredir2001.codfw.wmnet [18:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:33] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM ncredir2001.codfw.wmnet [18:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:37] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM ncredir2001.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None [18:43:14] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ncredir2001.codfw.wmnet [18:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:21] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM ncredir2001.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None [18:43:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:44:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [18:48:37] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir2001.codfw.wmnet [18:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:02] (03PS1) 10Kosta Harlan: TaskSet: Add ImageRecommendationFilter [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741135 (https://phabricator.wikimedia.org/T295410) [18:51:43] (03Abandoned) 10Kosta Harlan: TaskSet: Add ImageRecommendationFilter [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741135 (https://phabricator.wikimedia.org/T295410) (owner: 10Kosta Harlan) [18:51:54] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ncredir2002.codfw.wmnet [18:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:59] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM ncredir2002.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None [18:52:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:54:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:57:16] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir2002.codfw.wmnet [18:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:19] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10Vgutierrez) [18:59:06] 10SRE, 10LDAP-Access-Requests: Change LDAP username? - https://phabricator.wikimedia.org/T296429 (10Majavah) 05Open→03Declined Declining as https://wikitech.wikimedia.org/wiki/SRE/LDAP/Renaming_users says that renaming existing users is not possible. You might need to create a new user if you want a userna... [18:59:16] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [19:00:05] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1900) [19:00:05] RoanKattouw and Urbanecm: Your horoscope predicts another unfortunate UTC evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1900). [19:00:05] No Gerrit patches in the queue for this window AFAICS. [19:01:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:03:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17831 and previous config saved to /var/cache/conftool/dbconfig/20211124-190343-ladsgroup.json [19:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:47] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [19:04:43] (03CR) 10Andrew Bogott: [C: 03+2] maintain-views.yaml: Restrict `localuser` table to prevent disclosure [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [19:04:53] (03PS3) 10Andrew Bogott: maintain-views.yaml: Restrict `localuser` table to prevent disclosure [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF)) [19:05:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:06:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [19:10:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:14:21] 10SRE, 10Platform Engineering: Technical advice on migrating content from Outreach-wiki to Meta-wiki - https://phabricator.wikimedia.org/T296091 (10Ladsgroup) Redirects in foundationwiki work like that: https://foundation.wikimedia.org/w/index.php?title=Legal_talk:New_User_Welcome_Survey_Privacy_Statement/fa&r... [19:14:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:16:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [19:16:38] 10SRE, 10LDAP-Access-Requests: Change LDAP username? - https://phabricator.wikimedia.org/T296429 (10freephile) Thanks anyway @Majavah , and for pointing to the docs. I wasn't sure if it was possible and didn't find the wikitech info previously. [19:18:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17832 and previous config saved to /var/cache/conftool/dbconfig/20211124-191847-ladsgroup.json [19:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:52] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [19:19:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [19:19:58] !log run `maintain-views --all-databases --replace-all` on clouddb1013 for T292594 [19:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:21:39] PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1279.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:21:57] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1307.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:24:14] mm [19:25:13] I think there is a missing alert disabling because of backups, fixing [19:25:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:25:50] I will check if it happened on more instances [19:29:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:32:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:33:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17833 and previous config saved to /var/cache/conftool/dbconfig/20211124-193352-ladsgroup.json [19:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:57] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [19:38:12] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#7527732, @Majavah wrote: > Stretch is sta... [19:38:35] !log `sudo maintain-views --all-databases --replace-all` on clouddb1018 for T292594 [19:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [19:40:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:42:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [19:43:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:48:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17834 and previous config saved to /var/cache/conftool/dbconfig/20211124-194857-ladsgroup.json [19:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:02] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [19:52:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [19:56:08] (03PS4) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) [20:00:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:02:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:10:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [20:14:09] PROBLEM - MariaDB Replica IO: s5 on db1150 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:16:07] ^ only backups [20:18:55] that is not "only backups" that is a network error that shouldn't happen [20:19:22] I meant the host was only ba [20:19:26] Backups [20:19:38] I.e. no user facing impact [20:20:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:20:43] RECOVERY - MariaDB Replica IO: s5 on db1150 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:22:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:22:31] (03PS1) 10Papaul: Add elastic206[1-9]elastic207[0-2] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/741720 (https://phabricator.wikimedia.org/T294154) [20:23:16] there is something bad going on on that host, I will file a task [20:24:19] It's not the only eqiad db to have net issues [20:24:22] I think it is just being network saturated [20:25:16] I will downtime it and check it tomorrow [20:25:23] db1131 had a broken cable a few days ago (https://phabricator.wikimedia.org/T295952) [20:25:25] there is lots of regular package loss [20:25:29] Ah [20:26:35] it could be hw issues, or just resource starvation [20:30:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [20:31:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:35:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:40:37] (03PS2) 10Legoktm: mediawiki: Remove tidy binary [puppet] - 10https://gerrit.wikimedia.org/r/732386 [20:40:39] (03PS2) 10Legoktm: mediawiki: Remove libvips-tools [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802) [20:41:35] (03CR) 10Papaul: [C: 03+2] Add elastic206[1-9]elastic207[0-2] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/741720 (https://phabricator.wikimedia.org/T294154) (owner: 10Papaul) [20:42:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:43:01] (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove tidy binary [puppet] - 10https://gerrit.wikimedia.org/r/732386 (owner: 10Legoktm) [20:43:07] (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove libvips-tools [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802) (owner: 10Legoktm) [20:43:54] (03PS3) 10Legoktm: mediawiki: Remove libvips-tools [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802) [20:44:00] (03CR) 10Legoktm: [V: 03+2 C: 03+2] mediawiki: Remove libvips-tools [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802) (owner: 10Legoktm) [20:44:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:49:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, and 2 others: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) [20:50:14] (03PS7) 10Legoktm: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 [20:50:50] 10SRE, 10serviceops, 10Patch-For-Review: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Legoktm) 05Open→03Resolved a:03Legoktm [20:50:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:51:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2061.codfw.wmnet with OS buster [20:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, and 2 others: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2061.codfw.wmnet with OS buster [20:51:57] (03PS2) 10Legoktm: Set $wgMaxImageArea = false; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725101 (https://phabricator.wikimedia.org/T291014) [20:52:07] (03CR) 10Legoktm: [C: 03+2] Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm) [20:52:51] (03Merged) 10jenkins-bot: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm) [20:53:02] (03Abandoned) 10Legoktm: service: Enable paging for shellbox-constraints service [puppet] - 10https://gerrit.wikimedia.org/r/711737 (owner: 10Legoktm) [20:53:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:54:30] !log legoktm@deploy1002 Synchronized wmf-config/: Update configuration related to disabling Score functionality (duration: 00m 57s) [20:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:58] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Update README for purpose of this repository, remove unused fonts [mediawiki-config/fonts] - 10https://gerrit.wikimedia.org/r/732792 (owner: 10Legoktm) [20:58:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [20:58:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:52] (03CR) 10Legoktm: "I have fixes for about half of Effie's review sitting locally, I'm wondering if it would be easier to first have a bash or Python script t" [cookbooks] - 10https://gerrit.wikimedia.org/r/727605 (owner: 10Legoktm) [21:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T2100). [21:00:33] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Majavah) [21:00:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:18] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Majavah) 05Stalled→03Open I don't think this is stalled on a... [21:01:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:04:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:05:29] (03Abandoned) 10Legoktm: Have PagedTiffHandler use Shellbox on Commons for 10% of requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724577 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [21:07:24] (03PS1) 10Legoktm: Enable paging on all Shellboxes [puppet] - 10https://gerrit.wikimedia.org/r/741724 [21:10:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:13:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:14:35] (03CR) 10Legoktm: [C: 03+2] Enable paging on all Shellboxes [puppet] - 10https://gerrit.wikimedia.org/r/741724 (owner: 10Legoktm) [21:18:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:21:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2061.codfw.wmnet with OS buster [21:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2061.codfw.wmnet with OS buster comp... [21:22:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:27:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:31:17] (03PS3) 10Legoktm: Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709560 [21:31:49] (03Abandoned) 10Legoktm: analytics: Migrate clean_jupyter_user_local_trash to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/708183 (https://phabricator.wikimedia.org/T286442) (owner: 10Legoktm) [21:31:53] (03Abandoned) 10Legoktm: analytics: Remove absented clean_jupyter_user_local_trash cron [puppet] - 10https://gerrit.wikimedia.org/r/708184 (https://phabricator.wikimedia.org/T273673) (owner: 10Legoktm) [21:33:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [21:33:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2062.codfw.wmnet with OS buster [21:33:35] (03CR) 10Legoktm: [C: 03+2] Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709560 (owner: 10Legoktm) [21:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2062.codfw.wmnet with OS buster [21:33:53] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add tox.ini for CI [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/708031 (owner: 10Legoktm) [21:34:20] (03Merged) 10jenkins-bot: Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709560 (owner: 10Legoktm) [21:35:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:35:46] !log legoktm@deploy1002 Synchronized wmf-config/: Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis (duration: 00m 57s) [21:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:40:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:44:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [21:48:02] (03PS1) 10Ebernhardson: Revert "Add repository-swift plugin" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/741734 (https://phabricator.wikimedia.org/T295705) [21:53:31] (03Abandoned) 10Legoktm: exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [21:53:40] (03Abandoned) 10Legoktm: exim: Clean up remnants of legacy_mailing_lists [puppet] - 10https://gerrit.wikimedia.org/r/681724 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [21:53:49] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Figure out if we can remove legacy domain support for mailing lists - https://phabricator.wikimedia.org/T280472 (10Legoktm) Comments from Gerrit: @herron said: > I'm in favor of removing this, but still see a fair amount of legacy list mail in the exim l... [21:54:10] (03PS2) 10Ebernhardson: Revert "Add repository-swift plugin" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/741734 (https://phabricator.wikimedia.org/T295705) [21:54:24] (03Abandoned) 10Legoktm: systemd: Ensure units are unmasked [puppet] - 10https://gerrit.wikimedia.org/r/701171 (https://phabricator.wikimedia.org/T285425) (owner: 10Legoktm) [21:54:50] (03CR) 10Dzahn: [C: 03+1] "There might be some more details but I think it's ok if you just merge and go ahead and test again. Incremental changes are good and it's " [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [21:57:19] (03Abandoned) 10Legoktm: httpd: Add directory for applications to add config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691287 (owner: 10Legoktm) [21:58:53] (03Abandoned) 10Legoktm: docker: Stop copying config for each Debian version [puppet] - 10https://gerrit.wikimedia.org/r/683979 (owner: 10Legoktm) [22:03:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2062.codfw.wmnet with OS buster [22:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2062.codfw.wmnet with OS buster comp... [22:06:09] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [22:06:58] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) 05Open→03Stalled It's stalled on bandwith of releng a... [22:08:20] (03PS2) 10Legoktm: scap: Port mwgrep to Python 3 and other cleanup [puppet] - 10https://gerrit.wikimedia.org/r/565800 [22:08:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2063.codfw.wmnet with OS buster [22:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2063.codfw.wmnet with OS buster [22:09:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [22:10:16] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/32620/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [22:10:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:12:06] (03CR) 10Legoktm: [C: 03+2] scap: Port mwgrep to Python 3 and other cleanup [puppet] - 10https://gerrit.wikimedia.org/r/565800 (owner: 10Legoktm) [22:13:26] (03CR) 10Dzahn: "hosts have been added to /etc/dsh/group/ci-docroot on deploy1002 (by puppet)" [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [22:13:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:17:52] (03PS1) 10Legoktm: Revert "scap: Port mwgrep to Python 3 and other cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/741138 [22:18:03] (03CR) 10Legoktm: [C: 03+2] Revert "scap: Port mwgrep to Python 3 and other cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/741138 (owner: 10Legoktm) [22:18:21] lol [22:19:52] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/32621/" [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [22:21:41] (03CR) 10Dzahn: [V: 03+1 C: 03+2] P::doc: use correct php_fpm path [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [22:23:30] (03CR) 10Dzahn: "[doc1002:~] $ file /run/php/php7.3-fpm.sock" [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [22:25:03] (03CR) 10Dzahn: "deployed first on new machines, now on prod machine" [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [22:26:26] (03CR) 10Dzahn: "restarted apache on doc1001 - no issues" [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [22:29:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:31:04] (03CR) 10Dzahn: "since this is using rsync::server::module directly and not rsync::quickdatacopy I think the firewall holes via ferm are not included and w" [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [22:33:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:36:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts gitlab-runner1001.wikimedia.org [22:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:32] !log running decom cookbook on gitlab-runner1001.wikimedia.org VM which was in state "ADMIN_down" and not used yet. to make room to recreate it as gitlab-runner1001.eqiad.wmnet T295481 [22:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:37] T295481: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 [22:39:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2063.codfw.wmnet with OS buster [22:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2063.codfw.wmnet with OS buster comp... [22:39:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:40:22] Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox [22:41:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2064.codfw.wmnet with OS buster [22:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2064.codfw.wmnet with OS buster [22:43:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab-runner1001.wikimedia.org [22:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:51] (03CR) 10Cwhite: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/32623/" [puppet] - 10https://gerrit.wikimedia.org/r/741707 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite) [22:44:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:50:04] !log Creating a new Ganeti VM and wondering which row to put it? [ganeti1009:~] $ for row in A B C D; do echo "row ${row}: $(sudo gnt-instance list -o name -F "pnode.group == 'row_${row}'" | wc -l) VMs"; done [22:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:39] (03PS3) 10Cwhite: profile: logstash: add production logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/732438 (https://phabricator.wikimedia.org/T288618) [22:52:36] (03CR) 10Legoktm: [C: 03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/739351 (https://phabricator.wikimedia.org/T295805) (owner: 10Herron) [22:52:38] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host gitlab-runner1001.eqiad.wmnet [22:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:56] (03CR) 10Cwhite: [C: 03+2] profile: logstash: add production logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/732438 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [22:55:03] (03CR) 10Legoktm: P::doc: sync data to non-active servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [22:55:19] (03CR) 10Legoktm: [C: 03+1] "LGTM, probably worth waiting until Monday though for the removal." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741115 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [22:58:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [22:59:20] blazegraph firing because it's burning [22:59:26] we might have to restart that [22:59:50] ryankemper: is this a data-reload in progress per https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly ? [23:01:23] papaul: install of wdqs ongoing? [23:02:06] (03CR) 10Legoktm: [C: 03+1] Add ownership annotations for more Service SRE services [puppet] - 10https://gerrit.wikimedia.org/r/738426 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [23:03:51] !log wcqs1001 - sudo systemctl restart wcqs-blazegraph - after <+jinxer-wm> (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators [23:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:47] (03PS1) 10PipelineBot: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/741739 [23:08:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab-runner1001.eqiad.wmnet [23:08:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://alerts.wikimedia.org [23:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:27] !log mwmaint1002 - sudo /usr/bin/find /var/lib/puppet/clientbucket/ -type f -size 1M -delete - to fix Icinga alert about large files in client bucket [23:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2064.codfw.wmnet with OS buster [23:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2064.codfw.wmnet with OS buster comp... [23:12:48] (03Abandoned) 10Legoktm: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/741739 (owner: 10PipelineBot) [23:15:00] (03CR) 10Dzahn: site and install_server: add gitlab-runner1001 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [23:15:25] (03PS3) 10Dzahn: site and install_server: add gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [23:18:44] (03CR) 10Dzahn: [C: 03+2] site and install_server: add gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [23:18:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:18:49] (03PS4) 10Dzahn: site and install_server: add gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [23:22:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2065.codfw.wmnet with OS buster [23:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2065.codfw.wmnet with OS buster [23:26:02] !log ganeti - bringing up new VM - sudo gnt-instance start gitlab-runner1001.eqiad.wmnet ; ran puppet on install1003; installing OS T295481 [23:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:07] T295481: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 [23:28:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [23:32:42] (03PS2) 10Dzahn: site: use gitlab_runner role on gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740691 (https://phabricator.wikimedia.org/T295481) [23:34:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:43:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:44:19] !log puppetmaster1001:~] $ sudo puppet cert sign gitlab-runner1001.eqiad.wmnet | sudo install_console gitlab-runner1001.eqiad.wmnet (T295481) [23:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:23] T295481: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 [23:44:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:52:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2065.codfw.wmnet with OS buster [23:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2065.codfw.wmnet with OS buster comp... [23:52:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:53:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:57:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [23:58:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [23:58:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:58:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [23:59:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2066.codfw.wmnet with OS buster [23:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2066.codfw.wmnet with OS buster