[00:00:04] <jouncebot>	 RoanKattouw and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T0000).
[00:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[00:18:07] <icinga-wm>	 PROBLEM - dump of s1 in codfw on alert1001 is CRITICAL: dump for s1 at codfw taken more than 8 days ago: Most recent backup 2021-11-16 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[00:25:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2012.codfw.wmnet with OS buster
[00:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:25:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2012.codfw.wmnet with OS buster completed:...
[00:28:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10Papaul)
[00:28:23] <icinga-wm>	 PROBLEM - dump of s1 in eqiad on alert1001 is CRITICAL: dump for s1 at eqiad taken more than 8 days ago: Most recent backup 2021-11-16 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[00:34:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Wikidata, and 2 others: Q2:(Need By: End of Q2) rack/setup/install wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T294297 (10Papaul) 05Open→03Resolved  complete
[00:34:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) p:05Triage→03Medium
[04:00:37] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 91.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[04:14:07] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:20:19] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[04:20:45] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:43:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[04:45:25] <icinga-wm>	 PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[04:46:13] <icinga-wm>	 RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms
[04:53:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[04:57:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[04:59:37] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[05:08:23] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 91.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[05:12:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[05:17:07] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[05:34:30] <wikibugs>	 (03PS1) 10Krinkle: alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368)
[05:34:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle)
[05:37:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[05:37:08] <wikibugs>	 (03PS2) 10Krinkle: alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368)
[05:47:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: After optimize table (T296143)', diff saved to https://phabricator.wikimedia.org/P17804 and previous config saved to /var/cache/conftool/dbconfig/20211124-054718-root.json
[05:47:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:23] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[05:48:39] <icinga-wm>	 PROBLEM - dump of m1 in codfw on alert1001 is CRITICAL: dump for m1 at codfw taken more than 8 days ago: Most recent backup 2021-11-16 05:19:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[05:58:07] <wikibugs>	 (03PS2) 10Marostegui: dbproxy10{17,21}: Change m5 standby host [puppet] - 10https://gerrit.wikimedia.org/r/740839 (https://phabricator.wikimedia.org/T288720)
[05:58:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy10{17,21}: Change m5 standby host [puppet] - 10https://gerrit.wikimedia.org/r/740839 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui)
[06:00:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[06:02:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: After optimize table (T296143)', diff saved to https://phabricator.wikimedia.org/P17805 and previous config saved to /var/cache/conftool/dbconfig/20211124-060221-root.json
[06:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:26] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[06:03:12] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Restore TTL back to 5M for m5-master [dns] - 10https://gerrit.wikimedia.org/r/740964 (https://phabricator.wikimedia.org/T288720)
[06:04:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Restore TTL back to 5M for m5-master [dns] - 10https://gerrit.wikimedia.org/r/740964 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui)
[06:05:54] <marostegui>	 !log Upgrade db1128's kernel T288720
[06:05:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:58] <stashbot>	 T288720: Failover m5 master (db1128) to db1132 to upgrade its kernel - https://phabricator.wikimedia.org/T288720
[06:17:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: After optimize table (T296143)', diff saved to https://phabricator.wikimedia.org/P17806 and previous config saved to /var/cache/conftool/dbconfig/20211124-061725-root.json
[06:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:30] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[06:19:38] <wikibugs>	 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, 10Patch-For-Review: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10Marostegui) p:05Triage→03Medium
[06:19:49] <wikibugs>	 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Marostegui) p:05Triage→03Medium
[06:28:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1065-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[06:32:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: After optimize table (T296143)', diff saved to https://phabricator.wikimedia.org/P17807 and previous config saved to /var/cache/conftool/dbconfig/20211124-063228-root.json
[06:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:34] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[06:38:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1065-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[06:45:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance (T296143)
[06:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance (T296143)
[06:45:07] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[06:45:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:17] <Amir1>	 !log running optimize table with replication on db1155:3314 (T296143)
[06:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:42] <Amir1>	 Since I just started a schema change, I go afk for a while
[07:06:03] <wikibugs>	 (03PS1) 10Marostegui: db1128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/740967 (https://phabricator.wikimedia.org/T295965)
[07:07:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/740967 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui)
[07:12:01] <elukey>	 !log drop /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 and other blockmgr-* dirs on stat1006 to free space on the root partition
[07:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:59] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T296300
[07:18:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10elukey) The host went down again, I acked the alert and didn't reboot it :)
[07:22:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:23:09] <elukey>	 !log reboot kubernetes1018 (role::insetup) to verify negotiated speed of eth interface
[07:23:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[07:28:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32594/console" [puppet] - 10https://gerrit.wikimedia.org/r/740763 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto)
[07:29:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: hieradata: Route search.wm.o to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[07:29:58] <elukey>	 kubernetes1018 seems not coming up from the reboot, nice
[07:30:16] <elukey>	 ah no it was only super slow, let's see
[07:30:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32595/console" [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[07:34:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 04-1] "We don't need to add search.wm.org to the alternate domains." [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[07:40:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1125.eqiad.wmnet with OS bullseye
[07:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:22] <wikibugs>	 (03PS3) 10Majavah: hieradata: Route search.wm.o to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224)
[08:04:38] <wikibugs>	 (03CR) 10Majavah: hieradata: Route search.wm.o to apple-search (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[08:05:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1125.eqiad.wmnet with OS bullseye
[08:05:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] trafficserver: rule for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/740763 (https://phabricator.wikimedia.org/T289224) (owner: 10Giuseppe Lavagetto)
[08:14:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:15:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org
[08:16:44] <wikibugs>	 (03PS1) 10Majavah: Remove search.wikimedia.org from appservers [puppet] - 10https://gerrit.wikimedia.org/r/741079 (https://phabricator.wikimedia.org/T289224)
[08:17:35] <_joe_>	 majavah: hold your horses :D
[08:18:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:22:45] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:24:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/741080
[08:25:28] <wikibugs>	 (03PS2) 10Muehlenhoff: Extend access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/741080
[08:25:55] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:00] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] admin: Add samwilson to analytic privatedata group [puppet] - 10https://gerrit.wikimedia.org/r/740826 (https://phabricator.wikimedia.org/T296161) (owner: 10MMandere)
[08:27:25] <wikibugs>	 (03PS3) 10Muehlenhoff: Extend access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/741080
[08:30:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for bumeh-ctr [puppet] - 10https://gerrit.wikimedia.org/r/741080 (owner: 10Muehlenhoff)
[08:31:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech, 10Patch-For-Review: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MMandere) 05Open→03Resolved a:03MMandere @Samwilson you now should be able to access the private data. Please let us know if you face any ch...
[08:33:09] <majavah>	 _joe_: it's broken :( <div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class='text-muted'><code>Request from (my address) via cp2027.codfw.wmnet, ATS/8.0.8<br>Error: 502, connect failed at 2021-11-24 08:32:17 GMT</code></p></div>
[08:34:14] <_joe_>	 majavah: uh what do you mean?
[08:34:37] <_joe_>	 how did you even end up in codfw
[08:34:55] <_joe_>	 that is, indeed, the only server that should be pointing to apple search
[08:35:02] <_joe_>	 but it doesn't in my tests
[08:35:03] <majavah>	 testing from a VPS
[08:35:29] <_joe_>	 connect failed, is also quite peculiar
[08:36:02] <majavah>	 I can also reproduce locally, "curl -k -H "Host: search.wikimedia.org" https://text-lb.codfw.wikimedia.org/huoh" gives a 502 with that
[08:37:04] <_joe_>	 oh wait
[08:37:08] <_joe_>	 that's not a valid request
[08:37:36] <majavah>	 it's just something to bypass caching, but it probably should not give a 502
[08:37:43] <_joe_>	 uhm
[08:37:46] <_joe_>	 yeah that's strange
[08:37:51] <_joe_>	 very strange
[08:38:00] <_joe_>	 btw
[08:38:06] <_joe_>	 I just ran on a single backend
[08:38:17] <_joe_>	 so I don't get why all requests seem to be funneled through it
[08:38:54] <_joe_>	 could not connect [CONNECTION_ERROR] to 10.2.1.68 for 'https://apple-search.discovery.wmnet:4013/?search=test'
[08:39:03] <_joe_>	 well if I use curl from the same server
[08:39:04] <_joe_>	 it works
[08:39:59] <_joe_>	 so, no idea what's wrong there
[08:40:55] <wikibugs>	 (03PS1) 10Ladsgroup: Set actor migration to write both on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741082 (https://phabricator.wikimedia.org/T275246)
[08:41:35] <vgutierrez>	 !log depool cp2027
[08:41:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech, 10Patch-For-Review: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) Thanks!  I'm still getting an error when I try to view FROM `event.visualeditorfeatureuse`:  > Permission denied: user=samwilson, acces...
[08:48:04] <wikibugs>	 (03PS1) 10Vgutierrez: cp5006: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005)
[08:49:01] <Amir1>	 jouncebot: nowandnext
[08:49:01] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 10 minute(s)
[08:49:01] <jouncebot>	 In 3 hour(s) and 10 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1200)
[08:49:05] <Amir1>	 cool
[08:49:09] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set actor migration to write both on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741082 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup)
[08:49:53] <wikibugs>	 (03Merged) 10jenkins-bot: Set actor migration to write both on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741082 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup)
[08:51:28] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apple-search' for release 'main' .
[08:51:29] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:741082|Set actor migration to write both on all wikis (T275246)]] (duration: 00m 57s)
[08:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:34] <stashbot>	 T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246
[08:52:30] <jelto>	 Just a short reminder: we will start re-deploy services in codfw Kubernetes cluster soon. Feel free to ping me any time.
[08:53:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:54:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Cumin alias for wcqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/741084
[08:55:17] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apple-search' for release 'main' .
[08:55:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[08:55:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:03] <_joe_>	 !log repooling cp2027
[08:56:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[08:59:07] <_joe_>	 majavah: fixed
[08:59:26] <majavah>	 yeah, and I see my curls in logstash
[08:59:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[08:59:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org
[09:01:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM deneb.codfw.wmnet
[09:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[09:04:31] <jelto>	 !log start re-deploy procedure in codfw Kubernetes T251305
[09:04:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:35] <stashbot>	 T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305
[09:07:40] <_joe_>	 jelto: if you're depooling all services, remember apple-search :P
[09:08:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[09:08:21] <_joe_>	 !log switching search.wikimedia.org to be served by the apple-search servcie
[09:08:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:51] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:09:47] <jelto>	 joe: I added apple-search to the list recently ;)
[09:10:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM deneb.codfw.wmnet
[09:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:57] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:11:03] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on apertium.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:11:04] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on apertium.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:06] <stashbot>	 T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305
[09:11:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[09:12:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle)
[09:12:43] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm not familiar with all the different bits e.g. if they require a restart but can merge the patch" [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle)
[09:13:14] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on api-gateway.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:15] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on api-gateway.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:19] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on apple-search.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:20] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on apple-search.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:25] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on blubberoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:26] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on blubberoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:30] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on citoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:32] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on citoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:35] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on cxserver.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:36] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cxserver.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:40] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on echostore.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:41] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on echostore.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:44] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventgate-analytics.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:46] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventgate-analytics.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:48] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventgate-analytics-external.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:50] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventgate-analytics-external.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:53] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventgate-logging-external.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:55] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventgate-logging-external.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:58] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventgate-main.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:59] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventgate-main.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:02] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventstreams.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:04] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventstreams.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:06] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on eventstreams-internal.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:08] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on eventstreams-internal.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:11] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on linkrecommendation.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:12] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on linkrecommendation.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:15] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mathoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:16] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mathoid.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:20] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mobileapps.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:21] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mobileapps.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:24] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on proton.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:26] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on proton.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:28] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on push-notifications.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:30] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on push-notifications.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:33] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on recommendation-api.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:35] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on recommendation-api.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:38] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on sessionstore.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:40] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on sessionstore.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:42] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:44] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:47] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox-constraints.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:48] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox-constraints.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:51] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox-media.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:53] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox-media.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:55] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox-syntaxhighlight.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:57] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox-syntaxhighlight.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:00] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on shellbox-timeline.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:02] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on shellbox-timeline.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:04] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on similar-users.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:05] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on similar-users.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:08] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on tegola-vector-tiles.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:10] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on tegola-vector-tiles.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:13] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on termbox.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:14] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on termbox.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:17] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on wikifeeds.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:18] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wikifeeds.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:21] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on zotero.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:22] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on zotero.svc.codfw.wmnet with reason: helm3 de-deploy T251305
[09:15:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:00] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet
[09:16:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:07] <stashbot>	 T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305
[09:16:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:44] <volans>	 jelto: pro-tip you can use multiple names for a single call to the downtime cookbook
[09:17:19] <volans>	 ;)
[09:17:44] <jelto>	 volands: thanks, I will try that the next time :) sorry for the spam
[09:18:23] <elukey>	 marostegui: cumin cumin
[09:18:26] * elukey runs away
[09:18:29] <volans>	 :-P
[09:18:53] <marostegui>	 haha
[09:19:41] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet
[09:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM failoid2002.codfw.wmnet
[09:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:21] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:22:10] <_joe_>	 elukey: please put the cumin you bought in the spicerack, near the nextbox. Thanks.
[09:22:14] <_joe_>	 *netbox
[09:22:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM failoid2002.codfw.wmnet
[09:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:44] <wikibugs>	 (03PS2) 10Vgutierrez: cp5006: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005)
[09:24:36] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=(apertium|api-gateway|blubberoid|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventstreams|eventstreams-internal|linkrecommendation|mathoid|mobileapps|proton|push-notifications|recommendation-api|sessionstore|shellbox|shellbox-constraints|shellbox-media|shellbox-syntaxhighlight|she
[09:24:36] <logmsgbot>	 llbox-timeline|similar-users|tegola-vector-tiles|termbox|wikifeeds|zotero)
[09:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:03] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32597/console" [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:26:30] <_joe_>	 jelto: I don't see apple-search there
[09:27:25] <jelto>	 apple-search is not pooled in codfw currently.. so I did not touch apple-search confctl 
[09:27:26] <jelto>	 {"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=apple-search"}
[09:28:47] <jayme>	 it's not pooled in eqiad as well _joe_
[09:28:58] <_joe_>	 oh right
[09:29:11] <_joe_>	 well if it's depooled on both sides, it results as pooled in both
[09:29:15] <_joe_>	 so let me pool eqiad
[09:29:22] <jayme>	 oh...TIL
[09:30:03] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=apple-search,name=eqiad
[09:30:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM planet2002.codfw.wmnet
[09:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere)
[09:31:45] <wikibugs>	 (03PS3) 10Vgutierrez: cp5006: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005)
[09:32:42] <jayme>	 _joe_: so IIUC there is a fallback in pybal so that if both DCs are pooled=false it treats both of them as if they where pooled?
[09:32:57] <_joe_>	 jayme: pybal has nothing to do with this
[09:33:00] <_joe_>	 it's the dns
[09:33:07] <_joe_>	 for a/a services, it does
[09:33:08] <jayme>	 gdns, sorry
[09:33:16] <_joe_>	 for a/p services, it sends you to failoid IIRC
[09:34:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM planet2002.codfw.wmnet
[09:34:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere) @nskaggs please help approving Taavi's request.
[09:35:07] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32598/console" [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:37:30] <James_F>	 Any SREer around to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/714068 for me (Beta Cluster change)?
[09:40:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[09:41:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10Urbanecm) This has my support. Majavah is very helpful, and this level of access would definitely let them to be even more helpful :-).
[09:41:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM mx2001.wikimedia.org
[09:41:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet
[09:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:52] <wikibugs>	 (03PS1) 10Elukey: WIP - kserve-inference: add support for local tls proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/741092
[09:45:19] <vgutierrez>	 !log depool cp5006 - T290005
[09:45:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:23] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:45:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM mx2001.wikimedia.org
[09:45:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install2003.wikimedia.org
[09:46:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:52] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet
[09:46:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:41] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cp5006: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741083 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:48:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[09:49:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet
[09:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install2003.wikimedia.org
[09:50:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:39] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' .
[09:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:14] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet
[09:53:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[09:53:42] <vgutierrez>	 !log restart varnish/haproxy on cp5006 - T290005
[09:53:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet
[09:53:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:45] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:53:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:25] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[09:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[09:55:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM debmonitor2002.codfw.wmnet
[09:55:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:21] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet
[09:56:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:35] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[09:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM debmonitor2002.codfw.wmnet
[09:56:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:29] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - apple-search_4013: Servers kubernetes2010.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:58:51] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'apple-search' for release 'main' .
[09:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:03] <jayme>	 ah, great - this is you jelto ^
[09:59:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "+1 for the Pontoon bits, thank you Majavah" [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[09:59:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetboard2001.codfw.wmnet
[09:59:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet
[10:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:37] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:01:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10elukey)
[10:01:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetboard2001.codfw.wmnet
[10:01:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:04] <vgutierrez>	 !log repool cp5006 - T290005
[10:02:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:08] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:02:38] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet
[10:02:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM puppetboard2002.codfw.wmnet
[10:02:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:59] <wikibugs>	 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) First of all, Sorry it took me so long to comment. Vacation, onboarding, etc.  I was involved in the work of collapsing a...
[10:04:02] <wikibugs>	 (03PS1) 10Inductiveload: enwikisource: enable anonymous talk page mobile tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741097 (https://phabricator.wikimedia.org/T54165)
[10:06:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetboard2002.codfw.wmnet
[10:06:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:33] <wikibugs>	 10SRE, 10vm-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) 05Open→03Resolved
[10:06:42] <wikibugs>	 (03PS2) 10Inductiveload: enwikisource: enable anonymous talk page mobile tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741097 (https://phabricator.wikimedia.org/T47955)
[10:06:55] <jelto>	 !log downtime PyBal backends health check for helm3 de-deploy T251305. I'm keeping an eye on icing and remove downtime as soon as I'm finished
[10:06:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:58] <stashbot>	 T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305
[10:07:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[10:08:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[10:10:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[10:12:01] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[10:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:43] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[10:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:02] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[10:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:09] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops: Kubernetes1018's eth negotiated speed is 10MB/s - https://phabricator.wikimedia.org/T296369 (10ayounsi) That looks like a faulty cable or interface, over to DCops for troubleshooting, let us know if you need Netops help.
[10:17:56] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[10:17:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:54] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[10:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:29] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
[10:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:51] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
[10:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:45] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
[10:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:10] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[10:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:15] <XioNoX>	 !log disable ping-offload for codfw - T294119
[10:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:18] <stashbot>	 T294119: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119
[10:25:28] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[10:25:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:59] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'production' .
[10:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_echostore_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:28:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ping2001.codfw.wmnet
[10:28:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:05] <jelto>	 ^ thats me, redeploying echostore
[10:30:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:30:43] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere) p:05Triage→03Medium
[10:32:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ping2001.codfw.wmnet
[10:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:44] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[10:33:44] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
[10:33:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:53] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: sre.hosts.upgrade-and-reboot: update reference to IcingaHost [cookbooks] - 10https://gerrit.wikimedia.org/r/741100
[10:36:07] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[10:36:07] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[10:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:10] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MMandere) 05Resolved→03Open @Samwilson, checking I'll advise once done.
[10:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:11] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[10:38:11] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[10:38:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM people2002.codfw.wmnet
[10:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:06] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) 05Open→03Resolved @MMandere don't worry, it's working now! :-) thanks!
[10:40:15] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[10:40:15] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[10:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM people2002.codfw.wmnet
[10:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:10] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/741100 (owner: 10Arturo Borrero Gonzalez)
[10:42:18] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[10:42:18] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[10:42:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:07] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:44:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[10:44:29] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[10:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MMandere) Great, you're welcome! Is there something else you did for it to  start working?
[10:46:43] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[10:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:00] <wikibugs>	 (03CR) 10Volans: "Possible typos, not 100% sure." [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff)
[10:47:29] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:47:42] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[10:47:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM xhgui2001.codfw.wmnet
[10:47:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:02] <XioNoX>	 !log rollback: disable ping-offload for codfw - T294119
[10:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:05] <stashbot>	 T294119: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119
[10:49:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM xhgui2001.codfw.wmnet
[10:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[10:50:06] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' .
[10:50:06] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[10:50:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sre.hosts.upgrade-and-reboot: update reference to IcingaHost [cookbooks] - 10https://gerrit.wikimedia.org/r/741100 (owner: 10Arturo Borrero Gonzalez)
[10:51:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM webperf2001.codfw.wmnet
[10:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:22] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' .
[10:52:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: profile::mediawiki::php: support kubernetes in php-fatal-error.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto)
[10:53:30] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.upgrade-and-reboot: update reference to IcingaHost [cookbooks] - 10https://gerrit.wikimedia.org/r/741100 (owner: 10Arturo Borrero Gonzalez)
[10:53:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: sre.hosts.upgrade-and-reboot: update reference to IcingaHost (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/741100 (owner: 10Arturo Borrero Gonzalez)
[10:53:52] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[10:53:53] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851)
[10:53:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:22] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) No, I don't think so. I did try logging out and in again, but the fix came some time after that.
[10:55:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM webperf2001.codfw.wmnet
[10:55:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:58] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MMandere) @Samwilson understood :)
[11:01:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto)
[11:02:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM webperf2002.codfw.wmnet
[11:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[11:03:14] <wikibugs>	 (03PS1) 10JMeybohm: Add read-only access for jayme [homer/public] - 10https://gerrit.wikimedia.org/r/741108
[11:05:27] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[11:05:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:06:29] <jelto>	 ^ thats me
[11:07:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM webperf2002.codfw.wmnet
[11:07:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10MoritzMuehlenhoff) @Samwilson : It seems related to Puppet (our configuration management system) run times. Your update that it was still failing happened 21 minutes af...
[11:07:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:08:26] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[11:08:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:37] <_joe_>	 jelto: can I deploy mwdebug to codfw or should it wait?
[11:09:00] <jelto>	 joe: I'll deploy it in ~3 min if that works for you. Its next in the list
[11:09:20] <_joe_>	 sure go on yourself then :)
[11:10:01] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Fix UDS check cmd [puppet] - 10https://gerrit.wikimedia.org/r/741109 (https://phabricator.wikimedia.org/T290005)
[11:10:27] <wikibugs>	 (03PS1) 10Jbond: wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110
[11:10:29] <wikibugs>	 (03PS1) 10Jbond: R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:10:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org
[11:11:11] <wikibugs>	 (03PS2) 10Jbond: R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:11:13] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309 (https://phabricator.wikimedia.org/T295148) (owner: 10Volans)
[11:11:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 (owner: 10Jbond)
[11:12:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond)
[11:12:59] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32600/console" [puppet] - 10https://gerrit.wikimedia.org/r/741109 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[11:13:01] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[11:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond)
[11:13:25] <_joe_>	 jelto: are you going to also recreate the pods?
[11:13:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[11:13:37] <icinga-wm>	 PROBLEM - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.59 and port 4444: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[11:13:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter2003.codfw.wmnet
[11:13:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:49] <_joe_>	 I guess you are :D
[11:15:13] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[11:15:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:43] <icinga-wm>	 RECOVERY - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is OK: OK - Certificate appservers-rw.discovery.wmnet will expire on Mon 06 Jul 2026 02:13:19 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[11:15:43] <wikibugs>	 (03PS3) 10Jbond: R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:16:15] <jelto>	 joe: yes the process recreates pods. Sorry forgot to downtime mwdebug. I think we have the same as with apple-search here. Are you using the pyball fallback that its pooled anyway? then the service might not be reachable the last ~5 minutes
[11:17:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter2003.codfw.wmnet
[11:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond)
[11:18:03] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
[11:18:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:55] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add ceph packages in the octopus/bullseye combo. [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175)
[11:18:57] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' .
[11:18:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter2004.codfw.wmnet
[11:19:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org
[11:21:10] <wikibugs>	 (03PS4) 10Jbond: R:uwsgi::app: add support for checking via http [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:21:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Kormat) > Maybe we need to revisit the alerting for hosts if they start to send false alerts often.  @Ladsgroup: I'm not following, why would a networking problem be a 'false' alert?
[11:21:34] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[11:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:50] <wikibugs>	 (03PS5) 10Jbond: R:uwsgi::app:Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:22:06] <wikibugs>	 (03PS6) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:22:52] <jelto>	 _joe_: I'm a bit concerned that the LogstashKafkaComsumerLag alert could be related to my re-deploy. Is this something I should take a look at now?
[11:23:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[11:23:18] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[11:23:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:24] <_joe_>	 jelto: mostly to the messages being ingested by logstash I would say
[11:23:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter2004.codfw.wmnet
[11:23:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:39] <_joe_>	 godog: do you have better suggestions?
[11:23:53] <_joe_>	 re: understanding what's causing the surge in logging
[11:24:19] <wikibugs>	 (03PS1) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114
[11:24:25] <godog>	 _joe_ jelto taking a look
[11:25:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1141.eqiad.wmnet with reason: Maintenance T296143
[11:25:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1141.eqiad.wmnet with reason: Maintenance T296143
[11:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:14] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[11:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:18] <wikibugs>	 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe)
[11:25:20] <wikibugs>	 (03PS7) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:25:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17808 and previous config saved to /var/cache/conftool/dbconfig/20211124-112539-ladsgroup.json
[11:25:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:58] <wikibugs>	 (03PS2) 10Jbond: wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110
[11:26:18] <godog>	 oh yeah that's been active for a while heh
[11:26:21] <godog>	 that == the alert
[11:26:25] <wikibugs>	 (03PS2) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114
[11:26:40] <wikibugs>	 (03PS8) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:26:42] <wikibugs>	 (03PS1) 10Majavah: Remove search.wikimedia.org files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741115 (https://phabricator.wikimedia.org/T289224)
[11:26:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 (owner: 10Jbond)
[11:27:29] <wikibugs>	 (03PS9) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:27:33] <godog>	 jelto: lag started at 6 UTC, was also that when you began your activities ?
[11:27:49] <Amir1>	 !log optimizing image.commonswiki in db1141 (T296143)
[11:27:49] <jelto>	 godog: no I started at 9 UTC today
[11:27:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:54] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: aptrepo: add ceph packages in the octopus/bullseye combo [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175)
[11:28:10] <wikibugs>	 (03PS10) 10Jbond: R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111
[11:28:22] <godog>	 jelto: yeah it must be something else, I'm taking a look anyways though
[11:28:33] <jelto>	 godog: great thanks a lot
[11:29:07] <icinga-wm>	 RECOVERY - dump of m1 in codfw on alert1001 is OK: Last dump for m1 at codfw (db2078.codfw.wmnet:3321) taken on 2021-11-24 10:03:10 (31 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[11:29:14] <wikibugs>	 (03CR) 10David Caro: aptrepo: add ceph packages in the octopus/bullseye combo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez)
[11:30:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32602/console" [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond)
[11:32:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[11:32:44] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
[11:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:54] <wikibugs>	 (03PS1) 10Jbond: rubocop: exclude lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/741117
[11:33:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] rubocop: exclude lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/741117 (owner: 10Jbond)
[11:33:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] R:uwsgi::app: Add parameter types [puppet] - 10https://gerrit.wikimedia.org/r/741111 (owner: 10Jbond)
[11:33:47] <wikibugs>	 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) After deploying the changes to php-fatal-error.php, we can now see the error messages delivered by php-wmerrors in logstash.
[11:34:12] <wikibugs>	 (03PS3) 10Jbond: wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110
[11:34:21] <wikibugs>	 (03PS3) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114
[11:35:09] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
[11:35:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:25] <godog>	 !log bounce apache2 on logstash1025
[11:35:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:18] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
[11:36:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:36] <wikibugs>	 (03CR) 10Muehlenhoff: aptrepo: add ceph packages in the octopus/bullseye combo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez)
[11:37:50] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Fix UDS check cmd [puppet] - 10https://gerrit.wikimedia.org/r/741109 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[11:37:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM poolcounter2004.codfw.wmnet
[11:37:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:51] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[11:38:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:40] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox' for release 'main' .
[11:40:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:13] <icinga-wm>	 RECOVERY - dump of s1 in codfw on alert1001 is OK: Last dump for s1 at codfw (db2141.codfw.wmnet:3311) taken on 2021-11-24 09:53:28 (162 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[11:41:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM poolcounter2004.codfw.wmnet
[11:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:55] <wikibugs>	 (03PS1) 10Jbond: P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304)
[11:42:04] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpbb: move tests for search.wikimedia.org to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/741119
[11:42:45] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' .
[11:42:45] <wikibugs>	 (03PS4) 10Jbond: wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110
[11:42:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM poolcounter2003.codfw.wmnet
[11:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:48] <wikibugs>	 (03PS4) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114
[11:44:24] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
[11:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:40] <wikibugs>	 (03PS2) 10Jbond: P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304)
[11:45:09] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
[11:45:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:13] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32604/console" [puppet] - 10https://gerrit.wikimedia.org/r/741114 (owner: 10Jbond)
[11:45:57] <wikibugs>	 (03CR) 10Jbond: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/741117 (owner: 10Jbond)
[11:45:57] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
[11:45:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:32] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' .
[11:48:33] <wikibugs>	 (03PS3) 10Jbond: P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304)
[11:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:22] <moritzm>	 !log systemctl reset-failed ifup@ens5.service on poolcounter2003 T273026
[11:49:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:25] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[11:50:44] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
[11:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM poolcounter2003.codfw.wmnet
[11:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:36] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: aptrepo: add ceph packages in the octopus/bullseye combo [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175)
[11:52:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: aptrepo: add ceph packages in the octopus/bullseye combo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez)
[11:53:02] <wikibugs>	 (03PS5) 10Jbond: R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114
[11:53:12] <wikibugs>	 (03PS4) 10Jbond: P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304)
[11:53:13] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[11:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez)
[11:54:18] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32607/console" [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304) (owner: 10Jbond)
[11:54:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[11:54:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM rpki2002.codfw.wmnet
[11:54:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:17] <wikibugs>	 10SRE-swift-storage: Media storage metadata inconsistent with Swift - https://phabricator.wikimedia.org/T289996 (10jcrespo)
[11:56:27] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
[11:56:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:29] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] aptrepo: add ceph packages in the octopus/bullseye combo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez)
[11:58:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM rpki2002.codfw.wmnet
[11:58:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:12] <wikibugs>	 (03PS3) 10WMDE-Fisch: VisualEditor template dialog: new sidebar and inline descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight)
[11:58:15] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' .
[11:58:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:58:49] <jelto>	 ^ thats maybe me, however I have to take a look what routinator is
[11:59:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::argpasre: parse args to command string [puppet] - 10https://gerrit.wikimedia.org/r/741110 (owner: 10Jbond)
[11:59:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] R:service::uwsgi: add support for nrpe check_http check [puppet] - 10https://gerrit.wikimedia.org/r/741114 (owner: 10Jbond)
[11:59:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netbox-dev2001.wikimedia.org
[11:59:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetboard::ng: Add nrpe check_http command [puppet] - 10https://gerrit.wikimedia.org/r/741118 (https://phabricator.wikimedia.org/T296304) (owner: 10Jbond)
[11:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1200). Please do the needful.
[12:00:04] <jouncebot>	 awight: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:09] <awight>	 I can deploy my patches :-)
[12:00:12] <Lucas_WMDE>	 ok :)
[12:00:47] <majavah>	 jelto: it's not you, it's moritzm's restart of rpki2002
[12:00:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[12:01:04] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' .
[12:01:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:13] <jelto>	 majavah: great thanks!
[12:02:05] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight)
[12:02:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:02:46] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[12:02:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netbox-dev2001.wikimedia.org
[12:02:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netbox2001.wikimedia.org
[12:03:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:20] <wikibugs>	 (03Merged) 10jenkins-bot: VisualEditor template dialog: new sidebar and inline descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740766 (https://phabricator.wikimedia.org/T284203) (owner: 10Awight)
[12:03:43] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' .
[12:03:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:19] <wikibugs>	 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lucas_Werkmeister_WMDE) Would it be better on Commons if we set `$wgWBClientSettings['entityUsageModifierLimits']['C']` to 1 instead...
[12:07:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[12:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netbox2001.wikimedia.org
[12:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[12:09:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] analytics:refinery:job:refine_sanitize: Fix refine_monitor offsets [puppet] - 10https://gerrit.wikimedia.org/r/740931 (owner: 10Mforns)
[12:10:05] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config: Config: [[gerrit:740766|VisualEditor template dialog: new sidebar and inline descriptions (T284203, T286992)]] (duration: 00m 57s)
[12:10:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:13] <stashbot>	 T286992: Deploy VE template dialog improvements to small set of wikis - https://phabricator.wikimedia.org/T286992
[12:10:13] <stashbot>	 T284203: Deploy inline descriptions, extended sidebar and bigger dialog to small set of wikis - https://phabricator.wikimedia.org/T284203
[12:10:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netboxdb2001.codfw.wmnet
[12:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:15] <wikibugs>	 (03PS2) 10Awight: [lint] fully-qualify classname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737193
[12:12:21] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737193 (owner: 10Awight)
[12:12:45] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[12:13:09] <wikibugs>	 (03Merged) 10jenkins-bot: [lint] fully-qualify classname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737193 (owner: 10Awight)
[12:13:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netboxdb2001.codfw.wmnet
[12:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:50] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[12:15:18] <wikibugs>	 (03PS1) 10Jelto: helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124
[12:15:56] <wikibugs>	 (03PS2) 10Awight: Replace global with parent scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737195
[12:16:06] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737195 (owner: 10Awight)
[12:16:31] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:737193|[lint] fully-qualify classname]] (duration: 00m 55s)
[12:16:32] <wikibugs>	 (03PS1) 10Jbond: puppetboard - service: update puppetboard live check [puppet] - 10https://gerrit.wikimedia.org/r/741146
[12:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:46] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[12:16:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto)
[12:16:59] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on hywiki - https://phabricator.wikimedia.org/T296382 (10Lucas_Werkmeister_WMDE)
[12:17:06] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on warwiki - https://phabricator.wikimedia.org/T296383 (10Lucas_Werkmeister_WMDE)
[12:17:16] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 (10Lucas_Werkmeister_WMDE)
[12:17:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32608/console" [puppet] - 10https://gerrit.wikimedia.org/r/741146 (owner: 10Jbond)
[12:18:04] <wikibugs>	 (03Merged) 10jenkins-bot: Replace global with parent scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737195 (owner: 10Awight)
[12:18:54] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add more alerts to the data-engineering team [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis)
[12:19:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[12:19:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] puppetboard - service: update puppetboard live check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741146 (owner: 10Jbond)
[12:20:25] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:20:54] <wikibugs>	 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) >>! In T188730#7526191, @Lucas_Werkmeister_WMDE wrote: > Would it be better on Commons if we set `$wgWBClientSettings['en...
[12:21:02] <wikibugs>	 (03Merged) 10jenkins-bot: Add more alerts to the data-engineering team [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis)
[12:21:12] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on hywiki - https://phabricator.wikimedia.org/T296382 (10Lucas_Werkmeister_WMDE) 2021-11-24: `lang=shell lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql hywiki <<< 'SELEC...
[12:21:41] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on warwiki - https://phabricator.wikimedia.org/T296383 (10Lucas_Werkmeister_WMDE) 2021-11-24:  `lang=shell lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql warwiki <<< 'SE...
[12:21:45] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:737195|Replace global with parent scope]] (duration: 00m 55s)
[12:21:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM releases2002.codfw.wmnet
[12:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:00] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 (10Lucas_Werkmeister_WMDE) 2021-11-24: `lang=shell lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql cebwiki <<< 'SEL...
[12:22:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - miscweb_4111: Servers kubernetes2016.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled: mwdebug_4444: Servers kubernetes2004.codfw.wmnet, k
[12:22:25] <icinga-wm>	 s2012.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:22:59] <jelto>	 ^thats me, miscweb needs some extra care and downtime was a bit short
[12:23:08] <awight>	 !log EU scap deployment finished
[12:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:23:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[12:23:52] <wikibugs>	 (03PS1) 10Jelto: miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149
[12:24:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM releases2002.codfw.wmnet
[12:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 (owner: 10Jelto)
[12:25:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader2001.wikimedia.org
[12:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader2001.wikimedia.org
[12:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:36] <wikibugs>	 (03CR) 10Btullis: superset: set webserver timeout to 180 seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771) (owner: 10Razzi)
[12:29:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader2002.wikimedia.org
[12:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:52] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[12:29:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:50] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "need to update the following reference:" [puppet] - 10https://gerrit.wikimedia.org/r/740903 (owner: 10Dzahn)
[12:32:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] public_cloud: Add public_clouds_shutdown to global config [puppet] - 10https://gerrit.wikimedia.org/r/740545 (owner: 10Jbond)
[12:32:18] <wikibugs>	 (03PS1) 10Jcrespo: argparams: Test edge cases [puppet] - 10https://gerrit.wikimedia.org/r/741152
[12:32:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader2002.wikimedia.org
[12:32:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:51] <wikibugs>	 (03PS2) 10Jcrespo: argparse: Test edge cases [puppet] - 10https://gerrit.wikimedia.org/r/741152
[12:33:10] <wikibugs>	 (03Abandoned) 10Jbond: WIP: do not merge - CR to test varnish changes [puppet] - 10https://gerrit.wikimedia.org/r/740842 (owner: 10Jbond)
[12:33:15] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:24] <wikibugs>	 (03PS5) 10Jbond: R:varnish:instance: Add genral public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891)
[12:33:32] <wikibugs>	 (03PS3) 10Jcrespo: argparse: Test edge cases [puppet] - 10https://gerrit.wikimedia.org/r/741152
[12:33:34] <wikibugs>	 (03PS10) 10Jbond: R:varnish:instance: Add hiere key to control cloud ratelimits [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891)
[12:35:11] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[12:36:17] <wikibugs>	 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) Two ideas for improving the current design:  - Normalize the table based on eu_aspect.   - While this would have been som...
[12:36:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[12:36:58] <wikibugs>	 (03PS1) 10MMandere: admin: Add user taavi to wmcs and labtest group [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192)
[12:37:12] <jbond>	 !log disable puppet for puppetdb reboot
[12:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:41] <icinga-wm>	 PROBLEM - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.59 and port 4444: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:39:20] <wikibugs>	 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score: upload.wikimedia.org does not set content-encoding headers for Score-generated lilypond files - https://phabricator.wikimedia.org/T287326 (10TheDJ) 05Open→03Resolved a:03TheDJ
[12:41:49] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "+1, uid matches and so do the groups (shell access)." [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere)
[12:43:58] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM puppetdb2002.codfw.wmnet
[12:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17809 and previous config saved to /var/cache/conftool/dbconfig/20211124-124420-ladsgroup.json
[12:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:24] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[12:44:32] <wikibugs>	 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lucas_Werkmeister_WMDE) > * Split the table to wbc_property_usage and wbc_item_usage and use numeric ids there. >   * I don't know w...
[12:45:53] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Cumin alias for wcqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/741084
[12:46:08] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=(apertium|api-gateway|apple-search|blubberoid|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventstreams|eventstreams-internal|linkrecommendation|mathoid|mobileapps|proton|push-notifications|recommendation-api|sessionstore|shellbox|shellbox-constraints|shellbox-media|shellbox-syntaxh
[12:46:08] <logmsgbot>	 ighlight|shellbox-timeline|similar-users|tegola-vector-tiles|termbox|wikifeeds|zotero)
[12:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:13] <wikibugs>	 (03CR) 10Muehlenhoff: Add Cumin alias for wcqs hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff)
[12:47:43] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM puppetdb2002.codfw.wmnet
[12:47:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:20] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[12:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:33] <jbond>	 !log enable puppet post puppetdb reboot
[12:48:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes2007.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:50:32] <wikibugs>	 10SRE, 10Commons, 10DBA, 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) Yeah, then possibly split that column into two, one numeric id and numeric identifier of the entity type (item=0, propert...
[12:50:39] <icinga-wm>	 RECOVERY - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is OK: OK - Certificate appservers-rw.discovery.wmnet will expire on Mon 06 Jul 2026 02:13:19 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:51:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:51:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM grafana2001.codfw.wmnet
[12:51:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:13] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] hiera::role::common::deployment_server update helmBinary codfw [puppet] - 10https://gerrit.wikimedia.org/r/736822 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[12:53:21] <wikibugs>	 (03PS2) 10Jelto: hiera::role::common::deployment_server update helmBinary codfw [puppet] - 10https://gerrit.wikimedia.org/r/736822 (https://phabricator.wikimedia.org/T251305)
[12:53:24] <wikibugs>	 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on cebwiki - https://phabricator.wikimedia.org/T296384 (10Marostegui)
[12:53:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:53:52] <wikibugs>	 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on warwiki - https://phabricator.wikimedia.org/T296383 (10Marostegui)
[12:53:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM grafana2001.codfw.wmnet
[12:54:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:38] <wikibugs>	 (03PS1) 10Klausman: site: Move non-vm ML machines in codfw to setup for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/741154 (https://phabricator.wikimedia.org/T294412)
[12:54:40] <wikibugs>	 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10wdwb-tech: Enable statement usage tracking on hywiki - https://phabricator.wikimedia.org/T296382 (10Marostegui)
[12:54:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM apt2001.wikimedia.org
[12:54:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:24] <wikibugs>	 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 3 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Marostegui)
[12:58:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[12:59:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1131 alerting due to network hiccup - https://phabricator.wikimedia.org/T295952 (10Ladsgroup) >>! In T295952#7526103, @Kormat wrote: >> Maybe we need to revisit the alerting for hosts if they start to send false alerts often. >  > @Ladsgroup: I'm not following, why would a networ...
[13:00:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM apt2001.wikimedia.org
[13:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17810 and previous config saved to /var/cache/conftool/dbconfig/20211124-130200-ladsgroup.json
[13:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:04] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[13:04:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:37] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[13:07:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:23] <wikibugs>	 (03PS2) 10Muehlenhoff: Point irc.wikimedia.org to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/740864 (https://phabricator.wikimedia.org/T294119)
[13:07:27] <icinga-wm>	 PROBLEM - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:10:50] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff)
[13:12:53] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Testing-Roadblocks, 10User-jbond: Allow using WMCS hiera lookup order in Puppet rspec tests - https://phabricator.wikimedia.org/T296327 (10jbond) @Majavah I have had a think about this and i don't think that it will work as expected.  Currently the shared spec help...
[13:13:28] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] Add Cumin alias for wcqs hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff)
[13:15:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17811 and previous config saved to /var/cache/conftool/dbconfig/20211124-131519-ladsgroup.json
[13:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:24] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[13:15:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/740913 (owner: 10Majavah)
[13:17:50] <wikibugs>	 (03CR) 10Muehlenhoff: Add Cumin alias for wcqs hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff)
[13:19:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[13:22:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-corp2001.wikimedia.org
[13:22:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:34] <icinga-wm>	 RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[13:25:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-corp2001.wikimedia.org
[13:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point irc.wikimedia.org to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/740864 (https://phabricator.wikimedia.org/T294119) (owner: 10Muehlenhoff)
[13:27:27] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2004.codfw.wmnet
[13:27:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:29] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM logstash2004.codfw.wmnet
[13:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:51] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2004.codfw.wmnet
[13:27:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica2005.wikimedia.org
[13:28:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:11] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2004.codfw.wmnet
[13:30:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica2005.wikimedia.org
[13:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add read-only access for jayme [homer/public] - 10https://gerrit.wikimedia.org/r/741108 (owner: 10JMeybohm)
[13:31:41] <wikibugs>	 (03Merged) 10jenkins-bot: Add read-only access for jayme [homer/public] - 10https://gerrit.wikimedia.org/r/741108 (owner: 10JMeybohm)
[13:33:07] <wikibugs>	 (03PS4) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152
[13:33:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[13:34:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/740915 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[13:34:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica2006.wikimedia.org
[13:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:53] <wikibugs>	 (03CR) 10Jcrespo: "I am not sure unit tests are running by default locally or remotelly, but the patch works when tested specifically:" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[13:35:21] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2005.codfw.wmnet
[13:35:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:20] <XioNoX>	 !log add Jayme r/o user to all network devices
[13:36:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1141 (T296143)', diff saved to https://phabricator.wikimedia.org/P17812 and previous config saved to /var/cache/conftool/dbconfig/20211124-133628-ladsgroup.json
[13:36:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:33] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[13:37:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica2006.wikimedia.org
[13:37:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:37] <Amir1>	 I'm about to use a script to depool db1142 automatically, if it misbehaves, don't worry
[13:37:51] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2005.codfw.wmnet
[13:37:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17813 and previous config saved to /var/cache/conftool/dbconfig/20211124-133809-ladsgroup.json
[13:38:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[13:39:21] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2006.codfw.wmnet
[13:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1142.eqiad.wmnet with reason: Maintenance T296143
[13:39:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1142.eqiad.wmnet with reason: Maintenance T296143
[13:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:56] <wikibugs>	 (03PS5) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152
[13:41:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[13:41:14] <wikibugs>	 (03PS6) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152
[13:41:35] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2006.codfw.wmnet
[13:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:43] <wikibugs>	 (03PS7) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152
[13:43:12] <wikibugs>	 (03CR) 10Jbond: argparse: Fix number of parameters when String argument contains spaces (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[13:43:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff)
[13:44:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[13:49:26] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2023.codfw.wmnet
[13:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:39] <wikibugs>	 (03PS8) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152
[13:50:44] <wikibugs>	 (03PS2) 10Klausman: site: Move non-vm ML machines in codfw to setup for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/741154 (https://phabricator.wikimedia.org/T294412)
[13:51:18] <wikibugs>	 (03CR) 10Jcrespo: "Should we escape also existing double quotes? 'lol"lol' => "lol\"lol" ?" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[13:51:23] <wikibugs>	 (03CR) 10Jbond: "LGTM but will need nskaggs approval" [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere)
[13:52:31] <wikibugs>	 (03PS3) 10Klausman: site: Move non-vm ML machines in codfw to setup for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/741154 (https://phabricator.wikimedia.org/T294412)
[13:52:40] <wikibugs>	 (03CR) 10Kormat: "I can't speak for the release/yaml stuff, but the rest LGTM. 2 minor comments." [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[13:53:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10jbond) >>! In T296192#7525855, @Urbanecm wrote: > This has my support. Majavah is very helpful, and this level of access would definitel...
[13:54:22] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[13:54:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM serpens.wikimedia.org
[13:54:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:07] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2023.codfw.wmnet
[13:55:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:28] <wikibugs>	 (03PS1) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463)
[13:56:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[13:56:36] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[13:57:38] <wikibugs>	 (03PS2) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463)
[13:58:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM serpens.wikimedia.org
[13:58:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:11] <wikibugs>	 (03CR) 10Muehlenhoff: "Let's also add an approval: line for those two groups and set it to Nicholas, please." [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere)
[14:00:49] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2024.codfw.wmnet
[14:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:52] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[14:04:56] <wikibugs>	 (03PS9) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152
[14:05:12] <wikibugs>	 (03CR) 10Jcrespo: "Another take-- let me know what you thing." [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:06:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[14:06:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM idp-test2001.wikimedia.org
[14:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:54] <wikibugs>	 (03Abandoned) 10Klausman: site: Move non-vm ML machines in codfw to setup for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/741154 (https://phabricator.wikimedia.org/T294412) (owner: 10Klausman)
[14:08:19] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) The re-deploy of codfw was successful. Some take-aways are added here which came up in the codfw migration. The plan to migrate eqiad Kubernetes to `helm3`:  * Announce maintenanc...
[14:08:34] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto)
[14:09:25] <wikibugs>	 (03CR) 10Jbond: argparse: Fix number of parameters when String argument contains spaces (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:09:30] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[14:10:27] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2024.codfw.wmnet
[14:10:29] <wikibugs>	 (03PS1) 10Jelto: hiera::role::common::deployment_server update helmBinary eqiad [puppet] - 10https://gerrit.wikimedia.org/r/741681 (https://phabricator.wikimedia.org/T251305)
[14:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp-test2001.wikimedia.org
[14:10:34] <icinga-wm>	 PROBLEM - Check systemd state on logstash2024 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:44] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Split the dbstore_multiinstance role into two others [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285)
[14:14:37] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Split the dbstore_multiinstance role into two others (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[14:15:29] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2025.codfw.wmnet
[14:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:16] <wikibugs>	 (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:16:52] <wikibugs>	 (03CR) 10Jbond: argparse: Fix number of parameters when String argument contains spaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:19:17] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:19:18] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:19:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "+1 on the profile::contacts::role_contacts/Cumin alias changes and one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[14:20:53] <wikibugs>	 (03CR) 10Jcrespo: "Doing most of that- although it is getting confusing." [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:21:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[14:21:12] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2025.codfw.wmnet
[14:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM idp2001.wikimedia.org
[14:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Split the dbstore_multiinstance role into two others (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[14:24:44] <wikibugs>	 (03CR) 10Jbond: argparse: Fix number of parameters when String argument contains spaces (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:25:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] mariadb: Split the dbstore_multiinstance role into two others (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[14:26:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[14:26:14] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2030.codfw.wmnet
[14:26:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp2001.wikimedia.org
[14:26:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:19] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:27:20] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:38] <hashar>	 _joe_: good news the Zuul queue overflow alarm no more shows up in this channel / sre :)
[14:28:16] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:28:17] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:28:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:53] <godog>	 !log systemctl reset-failed ifup@ens5.service on logstash2024 T273026
[14:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:56] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[14:30:13] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh2001.wikimedia.org
[14:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:22] <icinga-wm>	 RECOVERY - Check systemd state on logstash2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:12] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:31:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:15] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:21] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, and 5 others: Refactor puppet:base module to reduce unneeded shared code paths - https://phabricator.wikimedia.org/T289661 (10dcaro)
[14:31:29] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, and 5 others: Refactor puppet:base module to reduce unneeded shared code paths - https://phabricator.wikimedia.org/T289661 (10dcaro)
[14:31:52] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2030.codfw.wmnet
[14:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:57] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:32:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:01] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:33:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:08] <wikibugs>	 (03PS3) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463)
[14:34:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh2001.wikimedia.org
[14:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:40] <wikibugs>	 (03PS3) 10Jbond: puppetmaster - hiera: order site after role [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez)
[14:35:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[14:36:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh2002.wikimedia.org
[14:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:32] <wikibugs>	 (03PS10) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152
[14:36:54] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash2031.codfw.wmnet
[14:36:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:23] <wikibugs>	 (03CR) 10Jcrespo: "I think this does what you suggested- please forgive if I missed something :-)" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:39:04] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash2031.codfw.wmnet
[14:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh2002.wikimedia.org
[14:40:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:24] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good. +1" [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[14:42:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Looking good: https://puppet-compiler.wmflabs.org/compiler1002/32610/" [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[14:44:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum2001.codfw.wmnet
[14:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10fgiunchedi)
[14:45:39] <wikibugs>	 (03PS2) 10MMandere: admin: Add user taavi to wmcs and labtest group [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192)
[14:46:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "+1, addresses moritzm's comment." [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere)
[14:47:20] <wikibugs>	 (03CR) 10MMandere: admin: Add user taavi to wmcs and labtest group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere)
[14:48:41] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I wouldn't touch it for this patch scope, but more than open to change it on a followup patch, suggestions?" [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[14:49:23] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:49:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:29] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:49:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:37] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum2001.codfw.wmnet
[14:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM search-loader2001.codfw.wmnet
[14:50:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:00] <jynus>	 btullis, about to deploy gerrit:740815, I expect noop, but pinging thinking about the worst 
[14:52:39] <jynus>	 I will test it quickly on 2 hosts, revery if something unexpected happens
[14:52:43] <btullis>	 ack, thanks jynus.
[14:52:55] <jynus>	 and we can keep talking on the ticket
[14:53:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10nskaggs) Yes, this has my support. Thank you!
[14:53:12] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:22] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Split the dbstore_multiinstance role into two others [puppet] - 10https://gerrit.wikimedia.org/r/740815 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[14:54:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM search-loader2001.codfw.wmnet
[14:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:18] <wikibugs>	 (03PS11) 10Jbond: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:55:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum2002.codfw.wmnet
[14:55:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:56:11] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] "+1 from me for Taavi. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere)
[14:56:21] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[14:57:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17815 and previous config saved to /var/cache/conftool/dbconfig/20211124-145721-ladsgroup.json
[14:57:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:25] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[14:57:46] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] admin: Add user taavi to wmcs and labtest group [puppet] - 10https://gerrit.wikimedia.org/r/741153 (https://phabricator.wikimedia.org/T296192) (owner: 10MMandere)
[14:57:49] <jynus>	 btullis, all good, only thing that changed was motd and "contacts.yaml" (no idea what that is used for, but all expected)
[14:58:05] <btullis>	 jynus: Great, many thanks.
[14:58:12] <wikibugs>	 (03PS12) 10Jbond: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[14:59:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[14:59:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum2002.codfw.wmnet
[14:59:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:42] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) @elukey thanks
[14:59:46] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:47] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:08] <wikibugs>	 (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/32609/" [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez)
[15:00:51] <wikibugs>	 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, 10Patch-For-Review: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10jcrespo) Deployment went as expected- but now that I thought a bit, I think btull...
[15:01:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[15:01:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[15:02:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow2001.codfw.wmnet
[15:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:05] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "You were right- sorry, so many micro changes made things confusing. **Please go ahead and deploy at your convenience** if you want! This w" [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[15:03:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[15:03:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "wait, I saw a few deprecated comments. I think." [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[15:04:22] <wikibugs>	 (03PS1) 10Jbond: wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686
[15:04:57] <wikibugs>	 (03PS13) 10Jcrespo: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152
[15:05:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "That should be it." [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[15:05:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow2001.codfw.wmnet
[15:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:21] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[15:06:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:26] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.or
[15:06:27] <icinga-wm>	 etwork_monitoring%23BGP_status
[15:06:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[15:06:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:32] <wikibugs>	 (03PS2) 10Jbond: wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686
[15:07:18] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.or
[15:07:18] <icinga-wm>	 etwork_monitoring%23BGP_status
[15:07:36] <elukey>	 ta daaaan
[15:07:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere)
[15:07:55] <elukey>	 this is me and Tobias working on the codfw cluster, some issues with calico
[15:08:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] argparse: Fix number of parameters when String argument contains spaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[15:08:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:08:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM gitlab2001.wikimedia.org
[15:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[15:08:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[15:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[15:09:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) swapped DIMM B2 with DIMM A4
[15:09:42] <icinga-wm>	 RECOVERY - SSH on kubernetes1003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:12:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17817 and previous config saved to /var/cache/conftool/dbconfig/20211124-151226-ladsgroup.json
[15:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:30] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[15:12:34] <icinga-wm>	 RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.05 ms
[15:13:49] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) ml-serve-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:14:04] <icinga-wm>	 PROBLEM - puppet last run on ms-be2058 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:14:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10Majavah) ` taavi@runko ~> ssh cloudcontrol1003.wikimedia.org Linux cloudcontrol1003 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 Debian GNU/Li...
[15:14:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM gitlab2001.wikimedia.org
[15:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[15:17:41] <wikibugs>	 (03PS14) 10Jbond: argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[15:17:59] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM dragonfly-supernode2001.codfw.wmnet
[15:18:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:05] <wikibugs>	 (03PS3) 10Jbond: wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686
[15:18:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots, labtest-roots for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T296192 (10MMandere) 05Open→03Resolved a:03MMandere Thank you too @Majavah  for confirming access.
[15:18:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: move tests for search.wikimedia.org to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/741119 (owner: 10Giuseppe Lavagetto)
[15:20:14] <icinga-wm>	 RECOVERY - puppet last run on ms-be2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:20:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[15:21:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686 (owner: 10Jbond)
[15:21:40] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dragonfly-supernode2001.codfw.wmnet
[15:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:34] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:22:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm)
[15:23:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] argparse: Fix number of parameters when String argument contains spaces [puppet] - 10https://gerrit.wikimedia.org/r/741152 (owner: 10Jcrespo)
[15:23:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[15:23:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) ml-serve-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:23:41] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster2001.codfw.wmnet
[15:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:52] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 104, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:23:57] <elukey>	 downtiming nodes
[15:24:21] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Remove :: from profile setup on 2 mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/741689 (https://phabricator.wikimedia.org/T296285)
[15:25:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690
[15:26:27] <wikibugs>	 (03CR) 10Jcrespo: "As promised ;-) https://puppet-compiler.wmflabs.org/compiler1002/32611/" [puppet] - 10https://gerrit.wikimedia.org/r/741689 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[15:26:30] <wikibugs>	 10SRE, 10ops-ulsfo: Update PDUs name-server config - https://phabricator.wikimedia.org/T295668 (10Papaul)
[15:26:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690 (owner: 10Giuseppe Lavagetto)
[15:26:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690 (owner: 10Giuseppe Lavagetto)
[15:27:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17819 and previous config saved to /var/cache/conftool/dbconfig/20211124-152731-ladsgroup.json
[15:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:36] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[15:27:40] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690
[15:27:53] <_joe_>	 sigh I'm on a roll today
[15:28:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (4) ml-serve-ctrl2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:29:05] <elukey>	 \o/
[15:30:01] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagemaster2001.codfw.wmnet
[15:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm)
[15:31:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: add missing file in previous change [puppet] - 10https://gerrit.wikimedia.org/r/741690 (owner: 10Giuseppe Lavagetto)
[15:31:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM irc2001.wikimedia.org
[15:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:25] <wikibugs>	 (03PS4) 10Jbond: wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686
[15:32:40] <papaul>	 !log reboot ms-be2058 for firmware upgrade
[15:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:16] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Remove search.wikimedia.org from appservers [puppet] - 10https://gerrit.wikimedia.org/r/741079 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[15:33:46] <icinga-wm>	 PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100%
[15:34:09] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM schema2003.codfw.wmnet
[15:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM irc2001.wikimedia.org
[15:35:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM kafkamon2002.codfw.wmnet
[15:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:18] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM schema2003.codfw.wmnet
[15:36:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:24] <icinga-wm>	 RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.07 ms
[15:38:07] <wikibugs>	 (03CR) 10Hashar: "Not sure why rubocop did not complaint when I have send the original change. Anyway thank you for the follow up!" [puppet] - 10https://gerrit.wikimedia.org/r/741117 (owner: 10Jbond)
[15:39:10] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM schema2004.codfw.wmnet
[15:39:12] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005)
[15:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafkamon2002.codfw.wmnet
[15:39:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[15:41:41] <wikibugs>	 (03CR) 10Kormat: partmon: add reuse partmon profile for cassandra hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[15:42:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1142 (T296143)', diff saved to https://phabricator.wikimedia.org/P17820 and previous config saved to /var/cache/conftool/dbconfig/20211124-154236-ladsgroup.json
[15:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:40] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[15:43:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[15:44:13] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] mariadb: Remove :: from profile setup on 2 mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/741689 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[15:45:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1143.eqiad.wmnet with reason: Maintenance T296143
[15:45:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1143.eqiad.wmnet with reason: Maintenance T296143
[15:45:30] <wikibugs>	 (03CR) 10Jbond: rubocop: exclude lintian-junit-report (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741117 (owner: 10Jbond)
[15:45:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17821 and previous config saved to /var/cache/conftool/dbconfig/20211124-154533-ladsgroup.json
[15:45:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:55] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove :: from profile setup on 2 mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/741689 (https://phabricator.wikimedia.org/T296285) (owner: 10Jcrespo)
[15:48:20] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM schema2004.codfw.wmnet
[15:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:24] <icinga-wm>	 PROBLEM - Check systemd state on schema2004 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:49] <wikibugs>	 (03PS1) 10Ladsgroup: rdbms: Make TransactionProfiler logs more useful [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741132 (https://phabricator.wikimedia.org/T295706)
[15:49:36] <Amir1>	 jouncebot: nowandnext
[15:49:36] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 10 minute(s)
[15:49:36] <jouncebot>	 In 3 hour(s) and 10 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1900)
[15:49:36] <jouncebot>	 In 3 hour(s) and 10 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1900)
[15:49:41] <Amir1>	 nice
[15:49:48] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] rdbms: Make TransactionProfiler logs more useful [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741132 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup)
[15:50:17] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005)
[15:51:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32613/console" [puppet] - 10https://gerrit.wikimedia.org/r/741079 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[15:51:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] Remove search.wikimedia.org from appservers [puppet] - 10https://gerrit.wikimedia.org/r/741079 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[15:52:36] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32614/console" [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:55:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) I  received 10 out of 18 hosts. Can someone please update the racking information?  Thanks
[15:55:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::argparse: improve type checking [puppet] - 10https://gerrit.wikimedia.org/r/741686 (owner: 10Jbond)
[15:59:20] <icinga-wm>	 RECOVERY - Check systemd state on schema2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:59:49] <wikibugs>	 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey)
[16:00:09] <btullis>	 !log systemctl reset-failed ifup@ens5.service on schema2004 T273026
[16:00:10] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[16:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:14] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[16:00:15] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[16:00:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) Moved debate into {T296411}
[16:02:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10MoritzMuehlenhoff)
[16:07:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[16:07:51] <wikibugs>	 (03PS1) 10Ssingh: test_dns: add a DoT check against all doh* hosts [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/741698
[16:08:59] <wikibugs>	 (03Merged) 10jenkins-bot: rdbms: Make TransactionProfiler logs more useful [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741132 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup)
[16:09:15] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[16:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:18] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[16:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] test_dns: add a DoT check against all doh* hosts [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/741698 (owner: 10Ssingh)
[16:13:30] <Amir1>	 !log start of  "foreachwikiindblist s3 migrateRevisionActorTemp.php --sleep=2" in mwmaint1002 in a screen. It will take a month or  so (T275246)
[16:13:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[16:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:34] <stashbot>	 T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246
[16:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[16:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:58] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster2002.codfw.wmnet
[16:17:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[16:19:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[16:19:09] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster2002.codfw.wmnet
[16:19:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:21] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster2001.codfw.wmnet
[16:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm)
[16:23:02] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubestagetcd2001.codfw.wmnet
[16:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto)
[16:23:35] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster2001.codfw.wmnet
[16:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:27] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@6253399]: Regular analytics weekly train [analytics/refinery@6253399]
[16:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:32] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagetcd2001.codfw.wmnet
[16:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:21] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto)
[16:29:59] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubestagetcd2003.codfw.wmnet
[16:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm)
[16:30:36] <wikibugs>	 (03CR) 10Kormat: partmon: add reuse partmon profile for cassandra hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[16:31:28] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubetcd2004.codfw.wmnet
[16:31:29] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagetcd2003.codfw.wmnet
[16:31:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:04] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703
[16:33:00] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[16:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:02] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[16:33:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:39] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubetcd2004.codfw.wmnet
[16:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32615/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto)
[16:33:58] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubestagetcd2002.codfw.wmnet
[16:33:59] <Amir1>	 testing done, moving forward
[16:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32616/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto)
[16:35:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm)
[16:35:07] <icinga-wm>	 PROBLEM - Host ms-be2058 is DOWN: PING CRITICAL - Packet loss = 100%
[16:35:08] <wikibugs>	 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) Thanks a lot @jbond for all the info, I have other questions/doubts in mind, I think that we are close to find a solution but I feel that some things needs to be discussed first.  1) p12/jks bundles  The `...
[16:35:56] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/includes/libs/rdbms/: Backport: [[gerrit:741132|rdbms: Make TransactionProfiler logs more useful (T295706)]] (duration: 00m 57s)
[16:35:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:00] <stashbot>	 T295706: Improve TransactionProfiler as replacement for tendril's slow queries - https://phabricator.wikimedia.org/T295706
[16:36:02] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubetcd2006.codfw.wmnet
[16:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:18] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703
[16:36:41] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagetcd2002.codfw.wmnet
[16:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32617/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto)
[16:37:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] upgrade ecs to 1.11.0 [software/ecs] - 10https://gerrit.wikimedia.org/r/735417 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite)
[16:37:59] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[16:38:13] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubetcd2006.codfw.wmnet
[16:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:29] <wikibugs>	 (03Merged) 10jenkins-bot: upgrade ecs to 1.11.0 [software/ecs] - 10https://gerrit.wikimedia.org/r/735417 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite)
[16:40:57] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[16:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:00] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[16:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:36] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubetcd2005.codfw.wmnet
[16:41:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:55] <icinga-wm>	 RECOVERY - Host ms-be2058 is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms
[16:42:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm)
[16:42:43] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[16:42:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:53] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[16:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:57] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703
[16:43:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 (10Papaul) Firmware upgrade complete on the server. leaving the server up to see if the error shows on DIMM A4
[16:43:17] <wikibugs>	 (03PS1) 10JHathaway: admin: Add myself (jhathaway) [puppet] - 10https://gerrit.wikimedia.org/r/741705
[16:43:19] <wikibugs>	 (03PS1) 10JHathaway: admin: Add myself(jhathaway) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/741706
[16:43:26] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[16:43:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:48] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubetcd2005.codfw.wmnet
[16:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:00] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[16:44:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[16:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32618/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto)
[16:45:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ host wikifunctions.beta.wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/714068 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester)
[16:46:49] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[16:47:04] <wikibugs>	 (03PS1) 10Cwhite: logstash: deploy ecs 1.11.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/741707 (https://phabricator.wikimedia.org/T294581)
[16:47:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:47:20] <wikibugs>	 (03PS3) 10Razzi: superset: set webserver timeout to 180 seconds [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771)
[16:47:38] <wikibugs>	 (03CR) 10Jobo: [V: 03+2] admin: Add myself(jhathaway) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/741706 (owner: 10JHathaway)
[16:48:00] <wikibugs>	 (03CR) 10Dzahn: "oh, thank you for that. one time I was able to deploy just fine, the other times I wasn't and it timed out. as mentioned before it's not t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto)
[16:48:34] <wikibugs>	 (03CR) 10Jobo: [V: 03+2] admin: Add myself (jhathaway) [puppet] - 10https://gerrit.wikimedia.org/r/741705 (owner: 10JHathaway)
[16:49:38] <wikibugs>	 (03PS2) 10Cwhite: logstash: deploy ecs 1.11.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/741707 (https://phabricator.wikimedia.org/T294581)
[16:49:59] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[16:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:14] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[16:50:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:38] <wikibugs>	 (03PS1) 10Majavah: hieradata: fix beta wikifunction setup [puppet] - 10https://gerrit.wikimedia.org/r/741708
[16:51:17] <majavah>	 mutante: James_F: https://gerrit.wikimedia.org/r/c/operations/puppet/+/741708/
[16:51:27] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703
[16:52:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[16:52:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32619/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto)
[16:53:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "This change is now a NOOP on lvs1016, so I think it should be good to go." [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto)
[16:55:21] <wikibugs>	 (03CR) 10Dzahn: gitlab: restore script keep_config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth)
[16:55:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: Add myself (jhathaway) [puppet] - 10https://gerrit.wikimedia.org/r/741705 (owner: 10JHathaway)
[16:56:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez)
[16:56:45] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2005.codfw.wmnet
[16:56:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: Add myself(jhathaway) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/741706 (owner: 10JHathaway)
[16:56:57] <James_F>	 majavah: WF won't be a multilingual site?
[16:57:05] <majavah>	 it won't?
[16:57:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] hieradata: fix beta wikifunction setup [puppet] - 10https://gerrit.wikimedia.org/r/741708 (owner: 10Majavah)
[16:57:21] <mutante>	 wait.. :)
[16:57:28] <James_F>	 Oh, you mean a single site with multiple languages, unlike WP which is multiple sites each with one language?
[16:57:30] <majavah>	 I thought that it will like wikidata etc
[16:57:42] <James_F>	 Yeah, we're like Wikidata.
[16:57:53] <mutante>	 ACK. i will keep merging
[16:58:05] <James_F>	 But api.wikifunctions.org (and api.wikifunctions.beta.wmflabs.org) will be a non-MediaWiki install; is that OK?
[16:58:15] <mutante>	 or not, because 2 pending merges on master
[16:58:18] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@6253399]: Regular analytics weekly train [analytics/refinery@6253399] (duration: 32m 50s)
[16:58:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:35] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@6253399] (thin): Regular analytics weekly train THIN [analytics/refinery@6253399]
[16:58:35] <mutante>	 and they are access related.. so.. i'll give it a few
[16:58:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:42] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@6253399] (thin): Regular analytics weekly train THIN [analytics/refinery@6253399] (duration: 00m 07s)
[16:58:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:54] <majavah>	 James_F: yeah, it's fine, it need to be set up separately anyways
[16:58:55] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@6253399] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6253399]
[16:58:56] <James_F>	 mutante, majavah: Thank you both.
[16:58:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:58] <mutante>	 majavah: thanks! not merged on master just yet
[16:59:41] * James_F nods.
[17:00:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:00:20] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Krinkle) @Dzahn Just an idea, but if we create an alias of some...
[17:00:27] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2005.codfw.wmnet
[17:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:51] <wikibugs>	 10SRE, 10Observability-Logging: Develop tooling for quickly parsing 5xx and sampled-1000 logs - https://phabricator.wikimedia.org/T292682 (10lmata)
[17:01:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17826 and previous config saved to /var/cache/conftool/dbconfig/20211124-170100-ladsgroup.json
[17:01:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:04] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[17:01:17] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM registry2003.codfw.wmnet
[17:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:48] <jbond>	 mutante: happy for me to merge your change
[17:01:59] <mutante>	 jbond: ok, please do. cloud/beta only :)
[17:02:16] <jbond>	 cooll thanks
[17:02:18] <mutante>	 majavah: James_F: now
[17:02:27] <mutante>	 thanks as well
[17:02:32] <wikibugs>	 (03PS1) 10Ladsgroup: rdbms: Add full query to transaction profiler [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741134 (https://phabricator.wikimedia.org/T295706)
[17:03:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] rdbms: Add full query to transaction profiler [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741134 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup)
[17:05:28] <wikibugs>	 (03CR) 10Krinkle: alertmanager: Update address for perf-team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle)
[17:05:39] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry2003.codfw.wmnet
[17:05:40] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@6253399] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6253399] (duration: 06m 45s)
[17:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:14] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2006.codfw.wmnet
[17:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:54] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM registry2004.codfw.wmnet
[17:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm)
[17:07:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10ayounsi)
[17:08:31] <wikibugs>	 10SRE, 10Observability-Alerting: Icinga meta monitoring pages during icinga host reboots - https://phabricator.wikimedia.org/T274662 (10lmata)
[17:08:45] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw
[17:08:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:51] <wikibugs>	 10SRE, 10Citoid, 10Observability-Logging, 10Wikimedia-Logstash, and 3 others: Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10lmata)
[17:09:55] <wikibugs>	 (03Abandoned) 10Dzahn: rename base/files/labs to base/files/cloud [puppet] - 10https://gerrit.wikimedia.org/r/740903 (owner: 10Dzahn)
[17:10:01] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10lmata)
[17:10:31] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "it first needs https://phabricator.wikimedia.org/T296331#7525107" [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall)
[17:11:13] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry2004.codfw.wmnet
[17:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:25] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2006.codfw.wmnet
[17:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:38] <majavah>	 James_F: the domain is now configured on the apache side (and I purged the previous different error messages from the caches after getting confused) and requests are now making to mwmultiversion
[17:16:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17827 and previous config saved to /var/cache/conftool/dbconfig/20211124-171604-ladsgroup.json
[17:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:10] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[17:17:00] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM chartmuseum2001.codfw.wmnet
[17:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:17] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) @Krinkle sure, always a good idea to replace hardcoded ho...
[17:17:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[17:17:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10akosiaris) Thanks @papaul. We 'll get back to you!
[17:17:59] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2015.codfw.wmnet
[17:18:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:09] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2015.codfw.wmnet
[17:20:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:22] <wikibugs>	 (03CR) 10Razzi: [C: 03+1] "LGTM, would you like to pair on deploying this, Andrew?" [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF))
[17:20:41] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM chartmuseum2001.codfw.wmnet
[17:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:40] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubernetes2016.codfw.wmnet
[17:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:21:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:21:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:48] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] cache/text_haproxy: remove scholarships.wikimedia.org config [puppet] - 10https://gerrit.wikimedia.org/r/740907 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn)
[17:22:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10JMeybohm)
[17:23:00] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw
[17:23:00] <wikibugs>	 (03CR) 10Krinkle: "This is uncontroversial to merge as far as I'm concerned. I've checked the two hosts via ssh, they're up, have the same role as doc1001, a" [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[17:23:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[17:23:49] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubernetes2016.codfw.wmnet
[17:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:32] <wikibugs>	 (03Merged) 10jenkins-bot: rdbms: Add full query to transaction profiler [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741134 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup)
[17:25:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Delete roles for bare metal WMCS puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/740913 (owner: 10Majavah)
[17:25:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:25:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:36] <wikibugs>	 (03PS1) 10Jbond: no op change to demo puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/741710
[17:25:38] <wikibugs>	 (03PS1) 10Jbond: no op change to demo puppet-mere [puppet] - 10https://gerrit.wikimedia.org/r/741711
[17:26:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:26:04] <mutante>	 end of year = more people start to delete stuff :)
[17:26:17] <wikibugs>	 (03Restored) 10Hashar: scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[17:26:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppet_alert: Condider zero resources a failure [puppet] - 10https://gerrit.wikimedia.org/r/740897 (owner: 10Majavah)
[17:26:39] <mutante>	 bbiaw, afk
[17:27:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] no op change to demo puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/741710 (owner: 10Jbond)
[17:27:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] no op change to demo puppet-mere [puppet] - 10https://gerrit.wikimedia.org/r/741711 (owner: 10Jbond)
[17:27:09] <wikibugs>	 (03PS1) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653)
[17:27:21] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[17:27:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[17:27:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:45] <wikibugs>	 (03PS5) 10Hashar: scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[17:28:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[17:28:40] <wikibugs>	 (03CR) 10Hashar: "Requested by Timo, we can indeed have integration/docroot deployed to all hosts even if there is little bandwidth now to do the switch." [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[17:28:48] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] scap/dsh: add doc1002/doc2001 to ci-docroot hosts [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[17:29:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[17:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17828 and previous config saved to /var/cache/conftool/dbconfig/20211124-173110-ladsgroup.json
[17:31:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:14] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[17:31:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:06] <wikibugs>	 (03PS2) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653)
[17:33:17] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[17:34:28] <logmsgbot>	 !log jhathaway@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=puppetboard
[17:34:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:03] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/includes/libs/rdbms/: Backport: [[gerrit:741134|rdbms: Add full query to transaction profiler (T295706)]] (duration: 00m 56s)
[17:35:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:06] <stashbot>	 T295706: Improve TransactionProfiler as replacement for tendril's slow queries - https://phabricator.wikimedia.org/T295706
[17:39:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:40:35] <wikibugs>	 10ops-eqiad, 10DC-Ops: Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10RobH)
[17:40:48] <wikibugs>	 10ops-eqiad, 10DC-Ops: Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10RobH)
[17:41:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:41:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Seems like a nice refactor. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto)
[17:41:45] <wikibugs>	 10ops-eqiad, 10DC-Ops: Row E/F temp/humid probe installation - https://phabricator.wikimedia.org/T296424 (10RobH)
[17:44:20] <wikibugs>	 (03PS3) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653)
[17:44:37] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[17:46:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1143 (T296143)', diff saved to https://phabricator.wikimedia.org/P17829 and previous config saved to /var/cache/conftool/dbconfig/20211124-174615-ladsgroup.json
[17:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:19] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[17:47:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1144.eqiad.wmnet with reason: Maintenance T296143
[17:47:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1144.eqiad.wmnet with reason: Maintenance T296143
[17:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17830 and previous config saved to /var/cache/conftool/dbconfig/20211124-174723-ladsgroup.json
[17:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:48:37] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "See https://yaml.org/type/merge.html and https://ktomk.github.io/writing/yaml-anchor-alias-and-merge-key.html if you are unfamiliar with h" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto)
[17:53:10] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771) (owner: 10Razzi)
[17:54:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:54:11] <wikibugs>	 (03Abandoned) 10Majavah: hieradata: Route search.wm.o to apple-search [puppet] - 10https://gerrit.wikimedia.org/r/740642 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[17:59:03] <wikibugs>	 (03PS1) 10Majavah: P::doc: use correct php_fpm path [puppet] - 10https://gerrit.wikimedia.org/r/741715
[17:59:40] <wikibugs>	 (03PS2) 10Majavah: P::doc: use correct php_fpm path [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653)
[18:00:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:01:56] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cirrussearch: s/sanitizer/saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/740711 (https://phabricator.wikimedia.org/T295705) (owner: 10Ryan Kemper)
[18:02:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[18:04:51] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:09:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:11:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:12:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[18:13:13] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Majavah) >>! In T247653#7527389, @Dzahn wrote: >> should the new...
[18:14:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[18:14:07] <icinga-wm>	 RECOVERY - dump of s1 in eqiad on alert1001 is OK: Last dump for s1 at eqiad (db1140.eqiad.wmnet:3311) taken on 2021-11-24 09:48:02 (162 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[18:20:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:24:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:30:14] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM acmechief2001.codfw.wmnet
[18:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM acmechief2001.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None
[18:30:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:34:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[18:35:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:36:30] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM acmechief2001.codfw.wmnet
[18:36:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:55] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM acmechief-test2001.codfw.wmnet
[18:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM acmechief-test2001.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None
[18:41:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:42:12] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM acmechief-test2001.codfw.wmnet
[18:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:30] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ncredir2001.codfw.wmnet
[18:42:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:33] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM ncredir2001.codfw.wmnet
[18:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM ncredir2001.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None
[18:43:14] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ncredir2001.codfw.wmnet
[18:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM ncredir2001.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None
[18:43:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:44:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[18:48:37] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir2001.codfw.wmnet
[18:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:02] <wikibugs>	 (03PS1) 10Kosta Harlan: TaskSet: Add ImageRecommendationFilter [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741135 (https://phabricator.wikimedia.org/T295410)
[18:51:43] <wikibugs>	 (03Abandoned) 10Kosta Harlan: TaskSet: Add ImageRecommendationFilter [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/741135 (https://phabricator.wikimedia.org/T295410) (owner: 10Kosta Harlan)
[18:51:54] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ncredir2002.codfw.wmnet
[18:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10ops-monitoring-bot) VM ncredir2002.codfw.wmnet rebooted by vgutierrez@cumin1001 with reason: None
[18:52:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:54:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:57:16] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir2002.codfw.wmnet
[18:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294119 (10Vgutierrez)
[18:59:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Change LDAP username? - https://phabricator.wikimedia.org/T296429 (10Majavah) 05Open→03Declined Declining as https://wikitech.wikimedia.org/wiki/SRE/LDAP/Renaming_users says that renaming existing users is not possible. You might need to create a new user if you want a userna...
[18:59:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[19:00:05] <jouncebot>	 Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1900)
[19:00:05] <jouncebot>	 RoanKattouw and Urbanecm: Your horoscope predicts another unfortunate UTC evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T1900).
[19:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[19:01:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:03:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17831 and previous config saved to /var/cache/conftool/dbconfig/20211124-190343-ladsgroup.json
[19:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:47] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[19:04:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] maintain-views.yaml: Restrict `localuser` table to prevent disclosure [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF))
[19:04:53] <wikibugs>	 (03PS3) 10Andrew Bogott: maintain-views.yaml: Restrict `localuser` table to prevent disclosure [puppet] - 10https://gerrit.wikimedia.org/r/732740 (https://phabricator.wikimedia.org/T292594) (owner: 10Samuel (WMF))
[19:05:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:06:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[19:10:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:14:21] <wikibugs>	 10SRE, 10Platform Engineering: Technical advice on migrating content from Outreach-wiki to Meta-wiki - https://phabricator.wikimedia.org/T296091 (10Ladsgroup) Redirects in foundationwiki work like that: https://foundation.wikimedia.org/w/index.php?title=Legal_talk:New_User_Welcome_Survey_Privacy_Statement/fa&r...
[19:14:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:16:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[19:16:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Change LDAP username? - https://phabricator.wikimedia.org/T296429 (10freephile) Thanks anyway @Majavah , and for pointing to the docs. I wasn't sure if it was possible and didn't find the wikitech info previously.
[19:18:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17832 and previous config saved to /var/cache/conftool/dbconfig/20211124-191847-ladsgroup.json
[19:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:52] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[19:19:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[19:19:58] <razzi>	 !log run `maintain-views --all-databases --replace-all` on clouddb1013 for T292594
[19:20:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:21:39] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1279.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:21:57] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1307.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:24:14] <jynus>	 mm
[19:25:13] <jynus>	 I think there is a missing alert disabling because of backups, fixing
[19:25:29] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:25:50] <jynus>	 I will check if it happened on more instances
[19:29:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:32:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:33:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17833 and previous config saved to /var/cache/conftool/dbconfig/20211124-193352-ladsgroup.json
[19:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:57] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[19:38:12] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) >>! In T247653#7527732, @Majavah wrote: >  Stretch is sta...
[19:38:35] <razzi>	 !log `sudo maintain-views --all-databases --replace-all` on clouddb1018 for T292594
[19:38:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[19:40:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:42:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[19:43:01] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:48:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1144:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17834 and previous config saved to /var/cache/conftool/dbconfig/20211124-194857-ladsgroup.json
[19:49:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:02] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[19:52:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[19:56:08] <wikibugs>	 (03PS4) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463)
[20:00:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:02:39] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:10:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[20:14:09] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on db1150 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db1130.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:16:07] <RhinosF1>	 ^ only backups
[20:18:55] <jynus>	 that is not "only backups" that is a network error that shouldn't happen
[20:19:22] <RhinosF1>	 I meant the host was only ba
[20:19:26] <RhinosF1>	 Backups
[20:19:38] <RhinosF1>	 I.e. no user facing impact
[20:20:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:20:43] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on db1150 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:22:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:22:31] <wikibugs>	 (03PS1) 10Papaul: Add elastic206[1-9]elastic207[0-2] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/741720 (https://phabricator.wikimedia.org/T294154)
[20:23:16] <jynus>	 there is something bad going on on that host, I will file a task
[20:24:19] <RhinosF1>	 It's not the only eqiad db to have net issues
[20:24:22] <jynus>	 I think it is just being network saturated
[20:25:16] <jynus>	 I will downtime it and check it tomorrow
[20:25:23] <RhinosF1>	 db1131 had a broken cable a few days ago (https://phabricator.wikimedia.org/T295952)
[20:25:25] <jynus>	 there is lots of regular package loss
[20:25:29] <RhinosF1>	 Ah
[20:26:35] <jynus>	 it could be hw issues, or just resource starvation
[20:30:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[20:31:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:35:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:40:37] <wikibugs>	 (03PS2) 10Legoktm: mediawiki: Remove tidy binary [puppet] - 10https://gerrit.wikimedia.org/r/732386
[20:40:39] <wikibugs>	 (03PS2) 10Legoktm: mediawiki: Remove libvips-tools [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802)
[20:41:35] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add elastic206[1-9]elastic207[0-2] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/741720 (https://phabricator.wikimedia.org/T294154) (owner: 10Papaul)
[20:42:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:43:01] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove tidy binary [puppet] - 10https://gerrit.wikimedia.org/r/732386 (owner: 10Legoktm)
[20:43:07] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove libvips-tools [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802) (owner: 10Legoktm)
[20:43:54] <wikibugs>	 (03PS3) 10Legoktm: mediawiki: Remove libvips-tools [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802)
[20:44:00] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] mediawiki: Remove libvips-tools [puppet] - 10https://gerrit.wikimedia.org/r/732387 (https://phabricator.wikimedia.org/T290802) (owner: 10Legoktm)
[20:44:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:49:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, and 2 others: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul)
[20:50:14] <wikibugs>	 (03PS7) 10Legoktm: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194
[20:50:50] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Legoktm) 05Open→03Resolved a:03Legoktm
[20:50:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:51:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2061.codfw.wmnet with OS buster
[20:51:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, and 2 others: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2061.codfw.wmnet with OS buster
[20:51:57] <wikibugs>	 (03PS2) 10Legoktm: Set $wgMaxImageArea = false; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725101 (https://phabricator.wikimedia.org/T291014)
[20:52:07] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm)
[20:52:51] <wikibugs>	 (03Merged) 10jenkins-bot: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm)
[20:53:02] <wikibugs>	 (03Abandoned) 10Legoktm: service: Enable paging for shellbox-constraints service [puppet] - 10https://gerrit.wikimedia.org/r/711737 (owner: 10Legoktm)
[20:53:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:54:30] <logmsgbot>	 !log legoktm@deploy1002 Synchronized wmf-config/: Update configuration related to disabling Score functionality (duration: 00m 57s)
[20:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:58] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] Update README for purpose of this repository, remove unused fonts [mediawiki-config/fonts] - 10https://gerrit.wikimedia.org/r/732792 (owner: 10Legoktm)
[20:58:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[20:58:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:58:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:52] <wikibugs>	 (03CR) 10Legoktm: "I have fixes for about half of Effie's review sitting locally, I'm wondering if it would be easier to first have a bash or Python script t" [cookbooks] - 10https://gerrit.wikimedia.org/r/727605 (owner: 10Legoktm)
[21:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211124T2100).
[21:00:33] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Majavah)
[21:00:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[21:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:18] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Majavah) 05Stalled→03Open I don't think this is stalled on a...
[21:01:57] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:04:09] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:05:29] <wikibugs>	 (03Abandoned) 10Legoktm: Have PagedTiffHandler use Shellbox on Commons for 10% of requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724577 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm)
[21:07:24] <wikibugs>	 (03PS1) 10Legoktm: Enable paging on all Shellboxes [puppet] - 10https://gerrit.wikimedia.org/r/741724
[21:10:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:13:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:14:35] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Enable paging on all Shellboxes [puppet] - 10https://gerrit.wikimedia.org/r/741724 (owner: 10Legoktm)
[21:18:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:21:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2061.codfw.wmnet with OS buster
[21:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2061.codfw.wmnet with OS buster comp...
[21:22:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:27:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:31:17] <wikibugs>	 (03PS3) 10Legoktm: Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709560
[21:31:49] <wikibugs>	 (03Abandoned) 10Legoktm: analytics: Migrate clean_jupyter_user_local_trash to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/708183 (https://phabricator.wikimedia.org/T286442) (owner: 10Legoktm)
[21:31:53] <wikibugs>	 (03Abandoned) 10Legoktm: analytics: Remove absented clean_jupyter_user_local_trash cron [puppet] - 10https://gerrit.wikimedia.org/r/708184 (https://phabricator.wikimedia.org/T273673) (owner: 10Legoktm)
[21:33:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[21:33:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2062.codfw.wmnet with OS buster
[21:33:35] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709560 (owner: 10Legoktm)
[21:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2062.codfw.wmnet with OS buster
[21:33:53] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add tox.ini for CI [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/708031 (owner: 10Legoktm)
[21:34:20] <wikibugs>	 (03Merged) 10jenkins-bot: Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709560 (owner: 10Legoktm)
[21:35:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:35:46] <logmsgbot>	 !log legoktm@deploy1002 Synchronized wmf-config/: Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis (duration: 00m 57s)
[21:35:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[21:37:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:40:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[21:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:44:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[21:48:02] <wikibugs>	 (03PS1) 10Ebernhardson: Revert "Add repository-swift plugin" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/741734 (https://phabricator.wikimedia.org/T295705)
[21:53:31] <wikibugs>	 (03Abandoned) 10Legoktm: exim: Drop support for legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm)
[21:53:40] <wikibugs>	 (03Abandoned) 10Legoktm: exim: Clean up remnants of legacy_mailing_lists [puppet] - 10https://gerrit.wikimedia.org/r/681724 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm)
[21:53:49] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Figure out if we can remove legacy domain support for mailing lists - https://phabricator.wikimedia.org/T280472 (10Legoktm) Comments from Gerrit:  @herron said: > I'm in favor of removing this, but still see a fair amount of legacy list mail in the exim l...
[21:54:10] <wikibugs>	 (03PS2) 10Ebernhardson: Revert "Add repository-swift plugin" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/741734 (https://phabricator.wikimedia.org/T295705)
[21:54:24] <wikibugs>	 (03Abandoned) 10Legoktm: systemd: Ensure units are unmasked [puppet] - 10https://gerrit.wikimedia.org/r/701171 (https://phabricator.wikimedia.org/T285425) (owner: 10Legoktm)
[21:54:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "There might be some more details but I think it's ok if you just merge and go ahead and test again. Incremental changes are good and it's " [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth)
[21:57:19] <wikibugs>	 (03Abandoned) 10Legoktm: httpd: Add directory for applications to add config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691287 (owner: 10Legoktm)
[21:58:53] <wikibugs>	 (03Abandoned) 10Legoktm: docker: Stop copying config for each Debian version [puppet] - 10https://gerrit.wikimedia.org/r/683979 (owner: 10Legoktm)
[22:03:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2062.codfw.wmnet with OS buster
[22:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2062.codfw.wmnet with OS buster comp...
[22:06:09] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn)
[22:06:58] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) 05Open→03Stalled It's stalled on bandwith of releng a...
[22:08:20] <wikibugs>	 (03PS2) 10Legoktm: scap: Port mwgrep to Python 3 and other cleanup [puppet] - 10https://gerrit.wikimedia.org/r/565800
[22:08:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2063.codfw.wmnet with OS buster
[22:08:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2063.codfw.wmnet with OS buster
[22:09:31] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[22:10:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/32620/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[22:10:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:12:06] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] scap: Port mwgrep to Python 3 and other cleanup [puppet] - 10https://gerrit.wikimedia.org/r/565800 (owner: 10Legoktm)
[22:13:26] <wikibugs>	 (03CR) 10Dzahn: "hosts have been added to /etc/dsh/group/ci-docroot on deploy1002 (by puppet)" [puppet] - 10https://gerrit.wikimedia.org/r/650306 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[22:13:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:17:52] <wikibugs>	 (03PS1) 10Legoktm: Revert "scap: Port mwgrep to Python 3 and other cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/741138
[22:18:03] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Revert "scap: Port mwgrep to Python 3 and other cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/741138 (owner: 10Legoktm)
[22:18:21] <Reedy>	 lol
[22:19:52] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/32621/" [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[22:21:41] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] P::doc: use correct php_fpm path [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[22:23:30] <wikibugs>	 (03CR) 10Dzahn: "[doc1002:~] $ file /run/php/php7.3-fpm.sock" [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[22:25:03] <wikibugs>	 (03CR) 10Dzahn: "deployed first on new machines, now on prod machine" [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[22:26:26] <wikibugs>	 (03CR) 10Dzahn: "restarted apache on doc1001 - no issues" [puppet] - 10https://gerrit.wikimedia.org/r/741715 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[22:29:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:31:04] <wikibugs>	 (03CR) 10Dzahn: "since this is using rsync::server::module directly and not rsync::quickdatacopy I think the firewall holes via ferm are not included and w" [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[22:33:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:36:21] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts gitlab-runner1001.wikimedia.org
[22:36:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:38:32] <mutante>	 !log running decom cookbook on gitlab-runner1001.wikimedia.org VM which was in state "ADMIN_down" and not used yet. to make room to recreate it as gitlab-runner1001.eqiad.wmnet T295481
[22:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:38:37] <stashbot>	 T295481: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481
[22:39:08] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2063.codfw.wmnet with OS buster
[22:39:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:39:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2063.codfw.wmnet with OS buster comp...
[22:39:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:40:22] <mutante>	 Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
[22:41:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2064.codfw.wmnet with OS buster
[22:41:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2064.codfw.wmnet with OS buster
[22:43:38] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab-runner1001.wikimedia.org
[22:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/32623/" [puppet] - 10https://gerrit.wikimedia.org/r/741707 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite)
[22:44:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:50:04] <mutante>	 !log Creating a new Ganeti VM and wondering which row to put it? [ganeti1009:~] $ for row in A B C D; do echo "row ${row}: $(sudo gnt-instance list -o name -F "pnode.group == 'row_${row}'" | wc -l) VMs"; done 
[22:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:39] <wikibugs>	 (03PS3) 10Cwhite: profile: logstash: add production logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/732438 (https://phabricator.wikimedia.org/T288618)
[22:52:36] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/739351 (https://phabricator.wikimedia.org/T295805) (owner: 10Herron)
[22:52:38] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host gitlab-runner1001.eqiad.wmnet
[22:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: logstash: add production logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/732438 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite)
[22:55:03] <wikibugs>	 (03CR) 10Legoktm: P::doc: sync data to non-active servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[22:55:19] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "LGTM, probably worth waiting until Monday though for the removal." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741115 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[22:58:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[22:59:20] <mutante>	 blazegraph firing because it's burning
[22:59:26] <mutante>	 we might have to restart that
[22:59:50] <mutante>	 ryankemper: is this a data-reload in progress per https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly ?
[23:01:23] <mutante>	 papaul: install of wdqs ongoing?
[23:02:06] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] Add ownership annotations for more Service SRE services [puppet] - 10https://gerrit.wikimedia.org/r/738426 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff)
[23:03:51] <mutante>	 !log wcqs1001 -  sudo systemctl restart wcqs-blazegraph - after <+jinxer-wm> (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators 
[23:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:47] <wikibugs>	 (03PS1) 10PipelineBot: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/741739
[23:08:01] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab-runner1001.eqiad.wmnet
[23:08:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wcqs1001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly  - https://alerts.wikimedia.org
[23:08:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:09:27] <mutante>	 !log mwmaint1002 - sudo /usr/bin/find /var/lib/puppet/clientbucket/ -type f -size 1M -delete  - to fix Icinga alert about large files in client bucket
[23:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:36] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2064.codfw.wmnet with OS buster
[23:11:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2064.codfw.wmnet with OS buster comp...
[23:12:48] <wikibugs>	 (03Abandoned) 10Legoktm: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/741739 (owner: 10PipelineBot)
[23:15:00] <wikibugs>	 (03CR) 10Dzahn: site and install_server: add gitlab-runner1001 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[23:15:25] <wikibugs>	 (03PS3) 10Dzahn: site and install_server: add gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[23:18:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site and install_server: add gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[23:18:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:18:49] <wikibugs>	 (03PS4) 10Dzahn: site and install_server: add gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740603 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto)
[23:22:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2065.codfw.wmnet with OS buster
[23:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:35] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2065.codfw.wmnet with OS buster
[23:26:02] <mutante>	 !log ganeti - bringing up new VM - sudo gnt-instance start gitlab-runner1001.eqiad.wmnet ; ran puppet on install1003; installing OS T295481
[23:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:26:07] <stashbot>	 T295481: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481
[23:28:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH)
[23:32:42] <wikibugs>	 (03PS2) 10Dzahn: site: use gitlab_runner role on gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740691 (https://phabricator.wikimedia.org/T295481)
[23:34:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:43:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:44:19] <mutante>	 !log puppetmaster1001:~] $ sudo puppet cert sign gitlab-runner1001.eqiad.wmnet |  sudo install_console gitlab-runner1001.eqiad.wmnet (T295481)
[23:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:23] <stashbot>	 T295481: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481
[23:44:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:52:21] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2065.codfw.wmnet with OS buster
[23:52:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:52:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2065.codfw.wmnet with OS buster comp...
[23:52:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:53:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:57:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH)
[23:58:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH)
[23:58:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:58:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH)
[23:59:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2066.codfw.wmnet with OS buster
[23:59:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2066.codfw.wmnet with OS buster