[00:01:13] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:01:34] <wikibugs>	 (03PS1) 10Cwhite: logstash: ship scap.announce channel to loki [puppet] - 10https://gerrit.wikimedia.org/r/804484 (https://phabricator.wikimedia.org/T222826)
[00:03:04] <wikibugs>	 (03PS3) 10Legoktm: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto)
[00:06:29] <icinga-wm>	 PROBLEM - Check systemd state on miscweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:09] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:12:25] <wikibugs>	 (03PS7) 10Eevans: Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801)
[00:18:28] <wikibugs>	 (03PS8) 10Eevans: Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801)
[00:20:35] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:10] <wikibugs>	 (03CR) 10Krinkle: "See also T175146 and T243096. I suspect, but can't be sure, that this RPC endpoint is no longer in use. CP-JobQueue now uses RunSingleJob " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01)
[00:23:33] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:26:41] <wikibugs>	 (03CR) 10Eevans: "PCC output: https://puppet-compiler.wmflabs.org/pcc-worker1003/35808/" [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans)
[00:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:31:31] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:02:47] <icinga-wm>	 RECOVERY - Maps - OSM synchronization lag - codfw on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.694e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12
[01:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:10:13] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:25:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:36:55] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:18:11] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[02:20:23] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[02:20:27] <icinga-wm>	 PROBLEM - Disk space on kafka-test1008 is CRITICAL: DISK CRITICAL - free space: / 3675 MB (3% inode=98%): /tmp 3675 MB (3% inode=98%): /var/tmp 3675 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1008&var-datasource=eqiad+prometheus/ops
[02:29:39] <icinga-wm>	 PROBLEM - Disk space on kafka-test1009 is CRITICAL: DISK CRITICAL - free space: / 3100 MB (3% inode=98%): /tmp 3100 MB (3% inode=98%): /var/tmp 3100 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1009&var-datasource=eqiad+prometheus/ops
[02:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:38:17] <icinga-wm>	 PROBLEM - Disk space on kafka-test1007 is CRITICAL: DISK CRITICAL - free space: / 2444 MB (2% inode=98%): /tmp 2444 MB (2% inode=98%): /var/tmp 2444 MB (2% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1007&var-datasource=eqiad+prometheus/ops
[03:07:43] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka-test1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[03:07:55] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1008 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:08:05] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka-test1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[03:08:09] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1009 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:08:09] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1007 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:08:49] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[03:08:59] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[03:09:01] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka-test1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[03:09:07] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-test1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[03:16:37] <wikibugs>	 (03PS1) 10Tim Starling: make_beta_config.py: run helm as helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/804486 (https://phabricator.wikimedia.org/T295578)
[03:19:35] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:20:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[03:22:39] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-test1006 is CRITICAL: 26 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1006
[03:23:17] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-test1010 is CRITICAL: 42 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1010
[03:23:43] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:53:41] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[04:01:51] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:02:41] <icinga-wm>	 RECOVERY - Kafka Broker Server on kafka-test1009 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[04:09:09] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka-test1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[04:26:09] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:30:17] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (webperf1004, ...), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:43:17] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:50:57] <icinga-wm>	 RECOVERY - Kafka Broker Server on kafka-test1007 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[04:57:23] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka-test1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[05:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:06:55] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:25:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:27:09] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:41:33] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:44:19] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:45:41] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder)
[05:46:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298560)', diff saved to https://phabricator.wikimedia.org/P29608 and previous config saved to /var/cache/conftool/dbconfig/20220610-054603-ladsgroup.json
[05:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:11] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[05:53:49] <icinga-wm>	 RECOVERY - Kafka Broker Server on kafka-test1008 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[06:00:17] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka-test1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[06:01:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29609 and previous config saved to /var/cache/conftool/dbconfig/20220610-060108-ladsgroup.json
[06:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:46] <wikibugs>	 (03PS5) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263
[06:01:48] <wikibugs>	 (03PS12) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261
[06:04:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[06:09:11] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:09:35] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:09:53] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:10:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:15:40] <wikibugs>	 (03PS6) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263
[06:15:42] <wikibugs>	 (03PS13) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261
[06:16:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29610 and previous config saved to /var/cache/conftool/dbconfig/20220610-061613-ladsgroup.json
[06:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:47] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:18:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi)
[06:20:52] <wikibugs>	 (03PS14) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261
[06:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:31:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298560)', diff saved to https://phabricator.wikimedia.org/P29611 and previous config saved to /var/cache/conftool/dbconfig/20220610-063119-ladsgroup.json
[06:31:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[06:31:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[06:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:25] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[06:31:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298560)', diff saved to https://phabricator.wikimedia.org/P29612 and previous config saved to /var/cache/conftool/dbconfig/20220610-063127-ladsgroup.json
[06:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:37] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:36:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804477 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:36:05] <wikibugs>	 (03PS2) 10Muehlenhoff: cpufrequtils: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804477 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:42:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804465 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:45:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804468 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:45:19] <wikibugs>	 (03PS2) 10Muehlenhoff: external_proxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804468 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:50:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804469 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:50:48] <wikibugs>	 (03PS2) 10Muehlenhoff: external_clouds_vendors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804469 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:55:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804473 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:55:55] <wikibugs>	 (03PS2) 10Muehlenhoff: dumpsuser: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804473 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220610T0700)
[07:03:49] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat
[07:03:49] <icinga-wm>	 ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[07:03:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, modulo dashboard link" [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[07:04:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Yep this LGTM! Thanks Daniel" [puppet] - 10https://gerrit.wikimedia.org/r/804416 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[07:05:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804475 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:05:15] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:05:20] <wikibugs>	 (03PS2) 10Muehlenhoff: docker_pusher: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804475 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:08:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, modulo dashboard link" [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[07:08:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804476 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:08:52] <wikibugs>	 (03PS2) 10Muehlenhoff: docker_pkg: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804476 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:09:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on webperf2002.codfw.wmnet,webperf1002.eqiad.wmnet with reason: Pending decom, new Bullseye nodes in place
[07:10:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on webperf2002.codfw.wmnet,webperf1002.eqiad.wmnet with reason: Pending decom, new Bullseye nodes in place
[07:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804478 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:11:29] <wikibugs>	 (03PS2) 10Muehlenhoff: conntrackd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804478 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:13:31] <wikibugs>	 (03PS2) 10Zabe: docker_registry_ha: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804474 (https://phabricator.wikimedia.org/T308013)
[07:13:54] <wikibugs>	 (03CR) 10Zabe: docker_registry_ha: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/804474 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:29:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the feedback!" [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[07:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:30:53] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10Gehel) Just to confirm: `analytics-privatedata-users` should be all that is required for @bscarone
[07:38:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1065.eqiad.wmnet with OS bullseye
[07:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:55] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1065.eqiad.wmnet with OS bullseye
[07:40:44] <wikibugs>	 (03PS1) 10Muehlenhoff: xenon: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/804546
[07:43:18] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/804546 (owner: 10Muehlenhoff)
[07:49:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:49:53] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:56:30] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1065.eqiad.wmnet with reason: host reimage
[07:56:31] <wikibugs>	 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10fgiunchedi) I took a look at the puppet master at `pontoon.traffic.eqiad1.wikimedia.cloud` and got puppet to run, however now a self-signed error is sho...
[07:56:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:39] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1065.eqiad.wmnet with reason: host reimage
[07:59:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:47] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Add a check that deb is unreleased in prepare_commit [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 (owner: 10Ebernhardson)
[08:27:55] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1065.eqiad.wmnet with OS bullseye
[08:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:59] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1065.eqiad.wmnet with OS bullseye completed: - ms-be1065 (**PASS**)   - Downtim...
[08:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:57:47] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade.
[08:57:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:02] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1066.eqiad.wmnet with OS bullseye
[09:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:07] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1066.eqiad.wmnet with OS bullseye
[09:02:30] <icinga-wm>	 RECOVERY - Kafka Broker Server on kafka-test1009 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[09:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:07:14] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka-test1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[09:08:07] <wikibugs>	 (03PS1) 10Btullis: Increase the JVM heap for the Hadoop namenode servers [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293)
[09:08:51] <wikibugs>	 (03CR) 10JMeybohm: "I'd say we do as we did in prod and just uninstall the "helm" package from deployment hosts to have alternatives pick up helm 3 as default" [deployment-charts] - 10https://gerrit.wikimedia.org/r/804486 (https://phabricator.wikimedia.org/T295578) (owner: 10Tim Starling)
[09:09:36] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35811/console" [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis)
[09:19:40] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1066.eqiad.wmnet with reason: host reimage
[09:19:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3002.esams.wmnet
[09:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:11] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1066.eqiad.wmnet with reason: host reimage
[09:24:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3002.esams.wmnet
[09:30:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[09:32:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3002.esams.wmnet to ganeti01.svc.esams.wmnet
[09:32:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:23] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35812/console" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[09:33:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3002.esams.wmnet to ganeti01.svc.esams.wmnet
[09:33:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10matthiasmullie)
[09:35:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10matthiasmullie)
[09:36:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[09:38:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1066.eqiad.wmnet with OS bullseye
[09:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:31] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1066.eqiad.wmnet with OS bullseye completed: - ms-be1066 (**PASS**)   - Downtim...
[09:39:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Matthias Mullie to contributors [puppet] - 10https://gerrit.wikimedia.org/r/804553
[09:40:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Matthias Mullie to contributors [puppet] - 10https://gerrit.wikimedia.org/r/804553 (owner: 10Muehlenhoff)
[09:44:33] <wikibugs>	 (03CR) 10JMeybohm: "I've no idea about the difference between "Systemd::Service[]" and "Service[]", but if "Systemd::Service[]" is not the right thing to noti" [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori)
[09:47:58] <wikibugs>	 (03CR) 10Muehlenhoff: service::docker: refresh service when config file is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori)
[09:50:32] <wikibugs>	 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10ayounsi) Thanks for the quick an thorough answer! Glad to see that there is progress upstream!  > exceptional nature of having to add new nodes It's not just this, but also the long term cost of...
[09:50:41] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder)
[09:51:44] <wikibugs>	 (03CR) 10JMeybohm: "> This commit contains the unmodified boilerplate files as generated by" [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori)
[09:56:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff)
[09:59:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:03:20] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214)
[10:04:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:24] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:14:40] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1067.eqiad.wmnet with OS bullseye
[10:14:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:44] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1067.eqiad.wmnet with OS bullseye
[10:19:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:23:08] <wikibugs>	 (03CR) 10Jbond: Make SREBatchBase operate on host groups (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[10:27:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) No problem @nskaggs I'm off today but I can put some more verbose instructions together next week and link t...
[10:30:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] service::docker: refresh service when config file is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori)
[10:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:32:35] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1067.eqiad.wmnet with reason: host reimage
[10:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1067.eqiad.wmnet with reason: host reimage
[10:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "While this will work, it's not backwards compatible. Older versions of the cxserver image won't be able to be deployed with this change as" [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 (owner: 10KartikMistry)
[10:46:08] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Thanks Ben :)" [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis)
[10:49:08] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:53:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:54:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:56:39] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade.
[10:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:11] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1067.eqiad.wmnet with OS bullseye
[10:59:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:15] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1067.eqiad.wmnet with OS bullseye completed: - ms-be1067 (**PASS**)   - Downtim...
[11:01:17] <wikibugs>	 (03PS4) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged [puppet] - 10https://gerrit.wikimedia.org/r/803560
[11:01:20] <wikibugs>	 (03CR) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond)
[11:02:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add page metadata to Wikibase JSON dumps [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) (owner: 10Mitar)
[11:08:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[11:09:35] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:13:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[11:13:36] <wikibugs>	 (03CR) 10KartikMistry: Update nodejs -> node command (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 (owner: 10KartikMistry)
[11:17:29] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:23:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] eventschemas: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804470 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[11:23:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] etcdmirror: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804471 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[11:24:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] envoyproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804472 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[11:24:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] galera: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804466 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[11:24:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] fifo_log_demux: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804467 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[11:34:15] <wikibugs>	 (03CR) 10Jelto: "that looks mostly good to me." [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall)
[11:35:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:54:37] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:00:32] <wikibugs>	 (03PS9) 10Krinkle: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[12:00:48] <wikibugs>	 (03PS10) 10Krinkle: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[12:00:59] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[12:01:03] <wikibugs>	 (03PS8) 10Krinkle: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling)
[12:01:18] <wikibugs>	 (03PS9) 10Krinkle: Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling)
[12:01:22] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling)
[12:06:31] <icinga-wm>	 RECOVERY - Kafka Broker Server on kafka-test1007 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[12:10:02] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Looks good to me. 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802947 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[12:11:10] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[12:11:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:17] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001"
[12:11:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:24] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "This is a work-in-progress, isn't it? I mean, what's the plan with these TODOs?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802946 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[12:12:13] <icinga-wm>	 RECOVERY - Check systemd state on kafka-test1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:12:15] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[12:12:33] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1007 is OK: SSL OK - Certificate kafka-test1007.eqiad.wmnet valid until 2023-01-24 11:32:00 +0000 (expires in 227 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[12:12:37] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] phpcs: move Misleading$wgDebugLogFile exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802840 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[12:13:11] <icinga-wm>	 RECOVERY - Kafka Broker Server on kafka-test1008 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[12:13:39] <icinga-wm>	 RECOVERY - Check systemd state on kafka-test1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:14:21] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1008 is OK: SSL OK - Certificate kafka-test1008.eqiad.wmnet valid until 2023-01-24 11:32:00 +0000 (expires in 227 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[12:16:09] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-test1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1006
[12:16:32] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): phpcs: move AssignmentInControlStructures exclusion inline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796360 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[12:17:25] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-test1010 is CRITICAL: 16 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1010
[12:18:07] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-test1009 is OK: SSL OK - Certificate kafka-test1009.eqiad.wmnet valid until 2023-01-24 11:31:00 +0000 (expires in 227 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[12:18:49] <icinga-wm>	 RECOVERY - Kafka Broker Server on kafka-test1009 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[12:19:41] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-test1010 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1010
[12:19:43] <icinga-wm>	 RECOVERY - Check systemd state on kafka-test1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:29] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:23:55] <icinga-wm>	 RECOVERY - Disk space on kafka-test1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1007&var-datasource=eqiad+prometheus/ops
[12:28:23] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:28:31] <icinga-wm>	 RECOVERY - Disk space on kafka-test1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1008&var-datasource=eqiad+prometheus/ops
[12:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:30:56] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1042.eqiad.wmnet
[12:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:19] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:32:37] <wikibugs>	 (03PS1) 10Hashar: deployment-prep: add keyholder agent for scap [puppet] - 10https://gerrit.wikimedia.org/r/804568 (https://phabricator.wikimedia.org/T310354)
[12:36:18] <jinxer-wm>	 (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:36:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 674 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:36:47] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1042.eqiad.wmnet
[12:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:43] <icinga-wm>	 RECOVERY - Disk space on kafka-test1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1009&var-datasource=eqiad+prometheus/ops
[12:38:37] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:43:33] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Increase the JVM heap for the Hadoop namenode servers [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis)
[12:44:28] <wikibugs>	 (03PS1) 10Jbond: scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572
[12:45:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, this looks like a leftover when role::wdqs::autodeploy was removed." [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede)
[12:46:10] <wikibugs>	 (03CR) 10Jbond: scap: update venv to use the system ca bundle (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[12:46:28] <wikibugs>	 (03CR) 10Jbond: "FYI i have manually applied this changed to netbox1002" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[12:46:46] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[12:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:16] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin1001"
[12:47:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:30] <wikibugs>	 10SRE, 10Keyholder: After arming a new key in keyholder, the identity file path does not show up - https://phabricator.wikimedia.org/T257329 (10hashar) A few years later the comment showing up instead of the file has hit me T310354#7994473  The fix above is to set the key comment to use the path using `ssh-key...
[12:51:47] <wikibugs>	 (03PS1) 10Btullis: Decrease the retention time on the kafka-test cluster to 1 day [puppet] - 10https://gerrit.wikimedia.org/r/804573 (https://phabricator.wikimedia.org/T310342)
[12:55:12] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Increase the JVM heap for the Hadoop namenode servers [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis)
[12:58:00] <wikibugs>	 (03PS2) 10Samtar: Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802685 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson)
[13:02:14] <wikibugs>	 (03PS1) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575
[13:04:07] <wikibugs>	 (03CR) 10Jforrester: Use a service locator to get a job runner (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01)
[13:04:27] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/804568 (https://phabricator.wikimedia.org/T310354) (owner: 10Hashar)
[13:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:09:56] <hashar>	 kostajh: I am finally reaching your docker image change https://gerrit.wikimedia.org/r/c/integration/config/+/803487 :)
[13:11:23] <kostajh>	 hashar: that probably needs to wait for the php unit entry point patch to be merged again
[13:11:47] <hashar>	 ah yeah probably
[13:12:40] <hashar>	 or we make the coverage shell script to detect whether tests/phpunit/phpunit.php is present
[13:12:41] <hashar>	 hmm
[13:13:37] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Decrease the retention time on the kafka-test cluster to 1 day [puppet] - 10https://gerrit.wikimedia.org/r/804573 (https://phabricator.wikimedia.org/T310342) (owner: 10Btullis)
[13:14:43] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:26:28] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Decrease the retention time on the kafka-test cluster to 1 day [puppet] - 10https://gerrit.wikimedia.org/r/804573 (https://phabricator.wikimedia.org/T310342) (owner: 10Btullis)
[13:32:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] phabricator: add blackbox http check [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[13:35:59] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:50:31] <wikibugs>	 (03PS1) 10Krinkle: Profiler: Fix reporting of Redis timeout error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804584
[13:51:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job probes/custom in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:55:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] netops: add PingUnreachable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[13:56:17] <icinga-wm>	 PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:56:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job probes/custom in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:57:15] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job probes/custom in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:00:57] <icinga-wm>	 RECOVERY - Host sretest1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[14:02:00] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job probes/custom in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:07:39] <icinga-wm>	 PROBLEM - SSH on sretest1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:09:41] <icinga-wm>	 RECOVERY - SSH on sretest1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:14:00] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Profiler: Fix reporting of Redis timeout error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804584 (owner: 10Krinkle)
[14:15:07] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: Fix reporting of Redis timeout error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804584 (owner: 10Krinkle)
[14:20:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:21:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:45] <wikibugs>	 10SRE, 10VPS-project-Codesearch, 10observability: add operations/alerts.git to hound codesearch.wmcloud.org - https://phabricator.wikimedia.org/T310364 (10CDanis)
[14:25:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:45] <wikibugs>	 10SRE, 10Librarization, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848 (10Krinkle)
[14:31:18] <wikibugs>	 (03PS1) 10Btullis: Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T310359)
[14:33:38] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35817/console" [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T310359) (owner: 10Btullis)
[14:35:16] <wikibugs>	 (03PS2) 10Btullis: Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T310359)
[14:35:33] <logmsgbot>	 !log krinkle@deploy1002 Synchronized src/Profiler.php: (no justification provided) (duration: 03m 43s)
[14:35:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:09] <wikibugs>	 (03CR) 10Btullis: "My PCC run didn't bring about any change on alert1001 but maybe that is because these are virtual resources?" [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T310359) (owner: 10Btullis)
[14:40:33] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:41:43] <wikibugs>	 10SRE, 10VPS-project-Codesearch, 10observability: add operations/alerts.git to hound codesearch.wmcloud.org - https://phabricator.wikimedia.org/T310364 (10Volans) Duplicate of T306881 ?
[14:42:44] <wikibugs>	 (03Abandoned) 10Jforrester: Partial revert "TextHandler::getTextTracksFromRows(): Remove unused code" [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802952 (https://phabricator.wikimedia.org/T309873) (owner: 10Jforrester)
[14:46:25] <wikibugs>	 (03Abandoned) 10Aqu: Reference the pid file used by the scheduler.service [puppet] - 10https://gerrit.wikimedia.org/r/803396 (https://phabricator.wikimedia.org/T310042) (owner: 10Aqu)
[14:49:30] <wikibugs>	 (03PS1) 10Andrew Bogott: heat: use the internal keystone port for keystone_authtoken config [puppet] - 10https://gerrit.wikimedia.org/r/804595
[14:50:39] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected v
[14:50:39] <icinga-wm>	 path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[14:52:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] heat: use the internal keystone port for keystone_authtoken config [puppet] - 10https://gerrit.wikimedia.org/r/804595 (owner: 10Andrew Bogott)
[14:54:01] <wikibugs>	 (03CR) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns)
[14:54:17] <icinga-wm>	 PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:54:27] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:56:01] <wikibugs>	 (03CR) 10Krinkle: docroot: Improve design of noc.wikimedia.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800680 (owner: 10Ladsgroup)
[14:56:08] <TheresNoTime>	 couple of 503s on test.wiki/en.wiki, intermittent
[14:56:11] <icinga-wm>	 RECOVERY - Host sretest1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[14:56:19] <Tamzin>	 Hello 503 my old friend
[14:56:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) @papaul, do you have interest in working on this more or should I take back the task? I'm thinking we should probably cu...
[14:56:59] <wikibugs>	 10SRE, 10VPS-project-Codesearch, 10observability: add operations/alerts.git to hound codesearch.wmcloud.org - https://phabricator.wikimedia.org/T310364 (10CDanis)
[14:57:01] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[14:57:19] <jinxer-wm>	 (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:57:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[14:57:41] * Emperor here
[14:57:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:58:01] <godog>	 checking too
[14:58:01] <jinxer-wm>	 (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[14:58:06] <herron>	 hey
[14:58:35] <RhinosF1>	 Things are fine here if that helps
[14:59:08] <wikibugs>	 (03PS3) 10Btullis: Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649)
[15:00:04] <jayme>	 o/
[15:00:40] <Emperor>	 [discussion in the other place]
[15:00:42] <godog>	 we're in _security 
[15:01:57] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat
[15:01:57] <icinga-wm>	 ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[15:02:19] <jinxer-wm>	 (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:02:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[15:03:01] <jinxer-wm>	 (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[15:05:44] <wikibugs>	 (03PS3) 10Ori: service::docker: refresh service when config file is changed [puppet] - 10https://gerrit.wikimedia.org/r/799420
[15:06:26] <wikibugs>	 (03CR) 10Ori: service::docker: refresh service when config file is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori)
[15:06:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[15:16:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[15:22:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:23:15] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1043.eqiad.wmnet
[15:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:05] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:26:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[15:27:43] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 30.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:28:35] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1043.eqiad.wmnet
[15:28:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:37] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 20.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:29:41] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 33.43 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:31:16] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10TheresNoTime) > Visit any Wikimedia project 2 minutes ago, any page //unable to reproduce currently — time machine broken//  ( **/j** )
[15:31:51] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:31:57] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 86.53 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:32:00] <RhinosF1>	 TheresNoTime: I'll let you know if I managed to fix mine
[15:32:11] <wikibugs>	 (03PS2) 10Krinkle: noc: Redesign noc.wikimedia.org after Wikimedia Design Style Guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800680 (owner: 10Ladsgroup)
[15:32:15] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 98.59 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[15:34:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Papaul) @andrew agree. I think the same partman recipe can do it by just removing the section below ` # setup the SDB disk with...
[15:36:17] <wikibugs>	 (03PS17) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[15:38:30] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] mediawiki: disable revalidation everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[15:38:41] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10dancy) 05Open→03Resolved a:03dancy
[15:44:39] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestCl
[15:44:39] <icinga-wm>	 apt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buf
[15:44:39] <icinga-wm>	 a: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[15:46:40] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10AlexisJazz) >>! In T310368#7995017, @TheresNoTime wrote: >> Visit any Wikimedia project 2 minutes ago, any page > //unable to reproduce currently — time machine broken//  ( **/j** )  I tried to repo...
[15:47:19] <wikibugs>	 (03PS3) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973
[15:47:53] <icinga-wm>	 PROBLEM - SSH on sretest1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:48:12] <wikibugs>	 (03CR) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns)
[15:48:12] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi There was indeed a brief moment of unavailability (retroactively-posted incident at https://www.wikimediastatus.net/incidents/5k90l09x2p6k)  I'm op...
[15:48:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns)
[15:49:59] <icinga-wm>	 RECOVERY - SSH on sretest1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:50:17] <wikibugs>	 (03CR) 10Dduvall: Provide buildkitd to GitLab runners (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall)
[15:51:01] <wikibugs>	 (03PS4) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973
[15:58:13] <wikibugs>	 (03PS3) 10Samtar: crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431)
[15:58:14] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10AlexisJazz) >>! In T310368#7995052, @fgiunchedi wrote: > There was indeed a brief moment of unavailability (retroactively-posted incident at https://www.wikimediastatus.net/incidents/5k90l09x2p6k) >...
[15:58:22] <wikibugs>	 (03PS3) 10Samtar: ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431)
[16:00:13] <wikibugs>	 (03PS4) 10Btullis: Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649)
[16:07:41] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10fgiunchedi) >>! In T310368#7995061, @AlexisJazz wrote: >>>! In T310368#7995052, @fgiunchedi wrote: >> There was indeed a brief moment of unavailability (retroactively-posted incident at https://www....
[16:25:53] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat
[16:25:53] <icinga-wm>	 ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[16:28:03] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) is CRITICAL: Test Translate enwiki protected page returned the unexpected status 404 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with
[16:28:03] <icinga-wm>	 ted value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[16:30:38] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:35:39] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:37:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:52] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10AlexisJazz) >>! In T310368#7995101, @CDanis wrote: >>>! In T310368#7995061, @AlexisJazz wrote: >> That just says "From 14:55 to 15:01 UTC users have been experiencing slow/unavailable access to Wiki...
[16:46:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:48:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Jclark-ctr) main board swapped. Tech just left
[16:52:25] <wikibugs>	 (03PS5) 10Ottomata: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns)
[16:54:40] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/pcc-worker1002/35820/" [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns)
[16:55:23] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns)
[16:58:11] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:03:19] <wikibugs>	 (03PS5) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723)
[17:03:21] <wikibugs>	 (03PS2) 10BCornwall: Traffic Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723)
[17:03:46] <wikibugs>	 (03PS6) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723)
[17:03:48] <wikibugs>	 (03PS3) 10BCornwall: Traffic Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723)
[17:04:33] <wikibugs>	 (03CR) 10BCornwall: Traffic Add alert for Varnish child restart (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[17:04:55] <wikibugs>	 (03CR) 10BCornwall: "Thanks for that. Duh!" [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[17:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:05:45] <wikibugs>	 (03PS4) 10BCornwall: Traffic: Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723)
[17:09:48] <Mitar>	 jbond here?
[17:13:27] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis)
[17:19:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "alright, thanks Filippo" [puppet] - 10https://gerrit.wikimedia.org/r/804416 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:23:24] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) @Arnoldokoth The last change we had uploaded in our meeting the other day is now merged. I would say we can call this resolved and close the ticket (but also creat...
[17:29:42] <wikibugs>	 (03PS6) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271)
[17:32:23] <wikibugs>	 (03PS7) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271)
[17:33:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall)
[17:34:52] <wikibugs>	 (03PS3) 10Ssingh: DHCP: make doh and durum hosts use the bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn)
[17:35:51] <wikibugs>	 (03CR) 10Dduvall: Provide buildkitd to GitLab runners (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall)
[17:36:18] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] DHCP: make doh and durum hosts use the bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn)
[17:36:26] <mutante>	 :)
[17:36:31] <sukhe>	 :D
[17:36:56] <wikibugs>	 (03PS8) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271)
[17:37:25] <wikibugs>	 (03PS9) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271)
[17:37:47] <wikibugs>	 (03CR) 10Zabe: vrts: rename cleanup cache service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:41:23] <wikibugs>	 (03PS10) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271)
[17:41:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: rename cleanup cache service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:42:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: rename cleanup cache service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:42:30] <wikibugs>	 (03CR) 10Dduvall: Provide buildkitd to GitLab runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall)
[17:46:13] <wikibugs>	 (03CR) 10Dduvall: "Thanks for the review, Jelto. I think I've addressed all of your comments. I do agree that the use of of a specific docket network adds a " [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall)
[17:55:09] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:05:49] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:11:32] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) Is https://phabricator.wikimedia.org/T305568#7992483 relevant here as well?  TL;DR were these provisioned as Cassandra hosts with two additional IPs(/DNS) f...
[18:19:36] <wikibugs>	 (03PS7) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[18:20:57] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:21:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10XCollazo-WMF)
[18:32:55] <icinga-wm>	 PROBLEM - Host cp1089 is DOWN: PING CRITICAL - Packet loss = 100%
[18:33:44] <sukhe>	 oh?
[18:34:20] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus)
[18:35:40] <sukhe>	 sigh, memory issues
[18:35:59] <sukhe>	 ^ filing a task for cp1089 and downtiming it for now
[18:37:02] <rzl>	 sounds like a....... dimm situation??
[18:37:04] <rzl>	 thanks sukhe 
[18:37:08] <sukhe>	 yep
[18:37:08] <sukhe>	 Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
[18:38:34] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2066 - https://phabricator.wikimedia.org/T309595 (10wiki_willy) a:03Papaul
[18:39:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1089 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T310387 (10ssingh)
[18:40:04] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp1089.eqiad.wmnet with reason: downtimed because of DIMM replacement: T310387
[18:40:07] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp1089.eqiad.wmnet with reason: downtimed because of DIMM replacement: T310387
[18:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:11] <stashbot>	 T310387: cp1089 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T310387
[18:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1089.eqiad.wmnet,service=ats-be
[18:42:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1089.eqiad.wmnet,service=varnish-fe
[18:42:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1089.eqiad.wmnet,service=ats-tls
[18:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:03] <wikibugs>	 (03CR) 10JMeybohm: "This change is ready for review." (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[19:02:01] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:25:15] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:29:26] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet
[19:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:23] <wikibugs>	 (03PS1) 10Andrew Bogott: Partman: give up on a two-hwraid configure and just configure the first drive. [puppet] - 10https://gerrit.wikimedia.org/r/804633 (https://phabricator.wikimedia.org/T302981)
[19:34:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Partman: give up on a two-hwraid configure and just configure the first drive. [puppet] - 10https://gerrit.wikimedia.org/r/804633 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott)
[19:35:18] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet
[19:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:56] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye
[19:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:10] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:58:05] <RhinosF1>	 ^ is expired downtime. I've let b.tulis know
[20:00:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:14:26] <wikibugs>	 (03PS1) 10CDanis: only page for NEL after 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/804640
[20:18:40] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat
[20:18:40] <icinga-wm>	 ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[20:20:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:23:00] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat
[20:23:00] <icinga-wm>	 ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[20:23:24] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:25:26] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye
[20:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:36] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye
[20:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:06] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:30:58] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10bscarone) @CDanis once I am on `bast1003.wikimedia.org` and ssh `stat1005.eqiad.wmnet` or `stat1008.eqiad.wmnet` I am prompted to enter a password, so I...
[20:31:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10bscarone) 05Resolved→03Open
[20:32:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10CDanis) >>! In T310021#7995684, @bscarone wrote: > @CDanis once I am on `bast1003.wikimedia.org` and ssh `stat1005.eqiad.wmnet` or `stat1008.eqiad.wmnet`...
[20:35:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10dancy)
[20:35:20] <wikibugs>	 (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: etc [puppet] - 10https://gerrit.wikimedia.org/r/804644
[20:36:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: etc [puppet] - 10https://gerrit.wikimedia.org/r/804644 (owner: 10Andrew Bogott)
[20:36:57] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye
[20:36:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:24] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye
[20:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10bscarone) Oh, I see, that was the issue. Now I managed to do it, thank you, I will close the task.
[20:40:51] <wikibugs>	 (03PS11) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271)
[20:41:06] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10bscarone) 05Open→03Resolved
[20:52:17] <wikibugs>	 (03PS1) 10Andrew Bogott: partman/hwraid-2dev.cfg: yet more etc [puppet] - 10https://gerrit.wikimedia.org/r/804645
[20:53:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] partman/hwraid-2dev.cfg: yet more etc [puppet] - 10https://gerrit.wikimedia.org/r/804645 (owner: 10Andrew Bogott)
[20:54:28] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye
[20:54:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:49] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye
[20:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:40] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:20:46] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling)
[21:23:30] <wikibugs>	 (03PS3) 10Krinkle: mediawiki: disable revalidation for api,app,parsoid clusters [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[21:25:03] <wikibugs>	 (03PS4) 10Krinkle: mediawiki: disable revalidation for api,app,parsoid clusters [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[21:25:10] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] mediawiki: disable revalidation for api,app,parsoid clusters [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[21:27:46] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:32:20] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35822/" [puppet] - 10https://gerrit.wikimedia.org/r/778243 (owner: 10Dzahn)
[21:37:54] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:39:50] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough daniel_zahn RhinosF1 already pinged btullis https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:41:26] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye
[21:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:07] <RhinosF1>	 Thanks mutante
[21:42:36] <RhinosF1>	 That was deliberately broken. The raid battery has been swapped with another server as I think it's going eventually.
[21:43:04] <wikibugs>	 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - commonswiki_file - https://phabricator.wikimedia.org/T310400 (10Dzahn)
[21:44:35] <wikibugs>	 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - commonswiki_file - https://phabricator.wikimedia.org/T310400 (10Dzahn)
[21:44:48] <RhinosF1>	 mutante: that's a dupe task
[21:45:10] <wikibugs>	 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - unassigned shard / commonswiki_file - https://phabricator.wikimedia.org/T310400 (10RhinosF1)
[21:45:16] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] multiversion: Simplify code and improve documentation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 (owner: 10Krinkle)
[21:45:22] <wikibugs>	 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - unassigned shard / commonswiki_file - https://phabricator.wikimedia.org/T310400 (10Dzahn) 05duplicate→03Open
[21:46:01] <wikibugs>	 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - unassigned shard / commonswiki_file - https://phabricator.wikimedia.org/T310400 (10Dzahn)
[21:46:40] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1001 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:46:40] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1002 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:46:40] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1003 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:46:40] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1004 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:46:40] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1005 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:46:41] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1006 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:49:49] <mutante>	 !log acking unhandled crit alerts on cloud dev hosts 
[21:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:29] <mutante>	 !log miscweb1002 - logrotate service was broken for unknown reasons, no recent change
[21:52:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:01] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) (owner: 10Elukey)
[21:58:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10CDanis) LGTM for clinic duty
[21:59:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10CDanis) 05Open→03Resolved a:03CDanis Done!
[21:59:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10CDanis) a:03KFrancis @KFrancis can you confirm an NDA on file for this researcher? Thanks!
[22:00:16] <mutante>	 !log miscweb1002 - systemctl start logrotate (it worked on second attempt, uh?, but it worked now) - systemctl reset-failed to clear icinga alerts
[22:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:34] <icinga-wm>	 RECOVERY - Check systemd state on miscweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:03:03] <mutante>	 !log mirror1001 - nginx service failed since > 1 month and unhandled alert - site is up though
[22:03:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:00] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:04:22] <mutante>	 !log mirror1001 - monitored nginx - package was in state "rc" and apache is running instead. systemctl reset-failed cleared alerts
[22:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10wiki_willy) Updated estimate, excluding the EX4200s and EX4300, is attached:  {F35226351}  The recycling pickup will be Wednesday (June 15) between 8a-12p ET and the onsite drive shredding will be later th...
[22:39:14] <wikibugs>	 (03PS1) 10Krinkle: noc: Add a menu in the new design, add some additional links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804670
[22:43:06] <wikibugs>	 (03PS3) 10Krinkle: multiversion: Simplify code and improve documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308
[23:08:18] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:27:24] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook