[00:01:13] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:01:34] (03PS1) 10Cwhite: logstash: ship scap.announce channel to loki [puppet] - 10https://gerrit.wikimedia.org/r/804484 (https://phabricator.wikimedia.org/T222826) [00:03:04] (03PS3) 10Legoktm: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto) [00:06:29] PROBLEM - Check systemd state on miscweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:09] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:12:25] (03PS7) 10Eevans: Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) [00:18:28] (03PS8) 10Eevans: Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) [00:20:35] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:10] (03CR) 10Krinkle: "See also T175146 and T243096. I suspect, but can't be sure, that this RPC endpoint is no longer in use. CP-JobQueue now uses RunSingleJob " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01) [00:23:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:41] (03CR) 10Eevans: "PCC output: https://puppet-compiler.wmflabs.org/pcc-worker1003/35808/" [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [00:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:31:31] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:47] RECOVERY - Maps - OSM synchronization lag - codfw on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.694e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12 [01:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:10:13] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:25:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:36:55] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:18:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:20:23] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:20:27] PROBLEM - Disk space on kafka-test1008 is CRITICAL: DISK CRITICAL - free space: / 3675 MB (3% inode=98%): /tmp 3675 MB (3% inode=98%): /var/tmp 3675 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1008&var-datasource=eqiad+prometheus/ops [02:29:39] PROBLEM - Disk space on kafka-test1009 is CRITICAL: DISK CRITICAL - free space: / 3100 MB (3% inode=98%): /tmp 3100 MB (3% inode=98%): /var/tmp 3100 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1009&var-datasource=eqiad+prometheus/ops [02:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:38:17] PROBLEM - Disk space on kafka-test1007 is CRITICAL: DISK CRITICAL - free space: / 2444 MB (2% inode=98%): /tmp 2444 MB (2% inode=98%): /var/tmp 2444 MB (2% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1007&var-datasource=eqiad+prometheus/ops [03:07:43] PROBLEM - Kafka Broker Server on kafka-test1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [03:07:55] PROBLEM - Check systemd state on kafka-test1008 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:05] PROBLEM - Kafka Broker Server on kafka-test1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [03:08:09] PROBLEM - Check systemd state on kafka-test1009 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:09] PROBLEM - Check systemd state on kafka-test1007 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:49] PROBLEM - Kafka broker TLS certificate validity on kafka-test1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [03:08:59] PROBLEM - Kafka broker TLS certificate validity on kafka-test1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [03:09:01] PROBLEM - Kafka Broker Server on kafka-test1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [03:09:07] PROBLEM - Kafka broker TLS certificate validity on kafka-test1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [03:16:37] (03PS1) 10Tim Starling: make_beta_config.py: run helm as helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/804486 (https://phabricator.wikimedia.org/T295578) [03:19:35] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:20:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:22:39] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-test1006 is CRITICAL: 26 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1006 [03:23:17] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-test1010 is CRITICAL: 42 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1010 [03:23:43] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:53:41] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [04:01:51] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:02:41] RECOVERY - Kafka Broker Server on kafka-test1009 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [04:09:09] PROBLEM - Kafka Broker Server on kafka-test1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [04:26:09] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:30:17] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (webperf1004, ...), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:43:17] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:50:57] RECOVERY - Kafka Broker Server on kafka-test1007 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [04:57:23] PROBLEM - Kafka Broker Server on kafka-test1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [05:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:06:55] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:25:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:27:09] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:41:33] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:44:19] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:45:41] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder) [05:46:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298560)', diff saved to https://phabricator.wikimedia.org/P29608 and previous config saved to /var/cache/conftool/dbconfig/20220610-054603-ladsgroup.json [05:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:11] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [05:53:49] RECOVERY - Kafka Broker Server on kafka-test1008 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [06:00:17] PROBLEM - Kafka Broker Server on kafka-test1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [06:01:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29609 and previous config saved to /var/cache/conftool/dbconfig/20220610-060108-ladsgroup.json [06:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:46] (03PS5) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [06:01:48] (03PS12) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [06:04:38] (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [06:09:11] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:09:35] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:53] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:15:40] (03PS6) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [06:15:42] (03PS13) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [06:16:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29610 and previous config saved to /var/cache/conftool/dbconfig/20220610-061613-ladsgroup.json [06:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:47] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:18:40] (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [06:20:52] (03PS14) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [06:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:31:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298560)', diff saved to https://phabricator.wikimedia.org/P29611 and previous config saved to /var/cache/conftool/dbconfig/20220610-063119-ladsgroup.json [06:31:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [06:31:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [06:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:25] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [06:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298560)', diff saved to https://phabricator.wikimedia.org/P29612 and previous config saved to /var/cache/conftool/dbconfig/20220610-063127-ladsgroup.json [06:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:37] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:36:00] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804477 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:36:05] (03PS2) 10Muehlenhoff: cpufrequtils: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804477 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:42:50] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804465 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:45:13] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804468 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:45:19] (03PS2) 10Muehlenhoff: external_proxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804468 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:50:41] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804469 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:50:48] (03PS2) 10Muehlenhoff: external_clouds_vendors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804469 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:55:50] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804473 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:55:55] (03PS2) 10Muehlenhoff: dumpsuser: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804473 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220610T0700) [07:03:49] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat [07:03:49] ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [07:03:49] (03CR) 10Filippo Giunchedi: "LGTM overall, modulo dashboard link" [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [07:04:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "Yep this LGTM! Thanks Daniel" [puppet] - 10https://gerrit.wikimedia.org/r/804416 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [07:05:13] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804475 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:05:15] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:05:20] (03PS2) 10Muehlenhoff: docker_pusher: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804475 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:08:11] (03CR) 10Filippo Giunchedi: "LGTM, modulo dashboard link" [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [07:08:46] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804476 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:08:52] (03PS2) 10Muehlenhoff: docker_pkg: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804476 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:09:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on webperf2002.codfw.wmnet,webperf1002.eqiad.wmnet with reason: Pending decom, new Bullseye nodes in place [07:10:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on webperf2002.codfw.wmnet,webperf1002.eqiad.wmnet with reason: Pending decom, new Bullseye nodes in place [07:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:23] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/804478 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:11:29] (03PS2) 10Muehlenhoff: conntrackd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804478 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:13:31] (03PS2) 10Zabe: docker_registry_ha: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804474 (https://phabricator.wikimedia.org/T308013) [07:13:54] (03CR) 10Zabe: docker_registry_ha: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/804474 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:29:30] (03CR) 10Filippo Giunchedi: "Thank you for the feedback!" [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:30:53] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10Gehel) Just to confirm: `analytics-privatedata-users` should be all that is required for @bscarone [07:38:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1065.eqiad.wmnet with OS bullseye [07:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:55] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1065.eqiad.wmnet with OS bullseye [07:40:44] (03PS1) 10Muehlenhoff: xenon: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/804546 [07:43:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/804546 (owner: 10Muehlenhoff) [07:49:27] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:53] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1065.eqiad.wmnet with reason: host reimage [07:56:31] 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10fgiunchedi) I took a look at the puppet master at `pontoon.traffic.eqiad1.wikimedia.cloud` and got puppet to run, however now a self-signed error is sho... [07:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1065.eqiad.wmnet with reason: host reimage [07:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:47] (03CR) 10DCausse: [C: 03+1] Add a check that deb is unreleased in prepare_commit [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 (owner: 10Ebernhardson) [08:27:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1065.eqiad.wmnet with OS bullseye [08:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:59] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1065.eqiad.wmnet with OS bullseye completed: - ms-be1065 (**PASS**) - Downtim... [08:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:57:47] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [08:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1066.eqiad.wmnet with OS bullseye [09:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:07] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1066.eqiad.wmnet with OS bullseye [09:02:30] RECOVERY - Kafka Broker Server on kafka-test1009 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [09:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:07:14] PROBLEM - Kafka Broker Server on kafka-test1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [09:08:07] (03PS1) 10Btullis: Increase the JVM heap for the Hadoop namenode servers [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293) [09:08:51] (03CR) 10JMeybohm: "I'd say we do as we did in prod and just uninstall the "helm" package from deployment hosts to have alternatives pick up helm 3 as default" [deployment-charts] - 10https://gerrit.wikimedia.org/r/804486 (https://phabricator.wikimedia.org/T295578) (owner: 10Tim Starling) [09:09:36] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35811/console" [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis) [09:19:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1066.eqiad.wmnet with reason: host reimage [09:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3002.esams.wmnet [09:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1066.eqiad.wmnet with reason: host reimage [09:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3002.esams.wmnet [09:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:32:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3002.esams.wmnet to ganeti01.svc.esams.wmnet [09:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:23] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35812/console" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [09:33:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3002.esams.wmnet to ganeti01.svc.esams.wmnet [09:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:52] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10matthiasmullie) [09:35:51] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10matthiasmullie) [09:36:54] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:38:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1066.eqiad.wmnet with OS bullseye [09:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:31] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1066.eqiad.wmnet with OS bullseye completed: - ms-be1066 (**PASS**) - Downtim... [09:39:36] (03PS1) 10Muehlenhoff: Add Matthias Mullie to contributors [puppet] - 10https://gerrit.wikimedia.org/r/804553 [09:40:53] (03CR) 10Muehlenhoff: [C: 03+2] Add Matthias Mullie to contributors [puppet] - 10https://gerrit.wikimedia.org/r/804553 (owner: 10Muehlenhoff) [09:44:33] (03CR) 10JMeybohm: "I've no idea about the difference between "Systemd::Service[]" and "Service[]", but if "Systemd::Service[]" is not the right thing to noti" [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori) [09:47:58] (03CR) 10Muehlenhoff: service::docker: refresh service when config file is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori) [09:50:32] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10ayounsi) Thanks for the quick an thorough answer! Glad to see that there is progress upstream! > exceptional nature of having to add new nodes It's not just this, but also the long term cost of... [09:50:41] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder) [09:51:44] (03CR) 10JMeybohm: "> This commit contains the unmodified boilerplate files as generated by" [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [09:56:59] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) [09:59:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:20] (03PS2) 10Muehlenhoff: Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214) [10:04:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:24] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:14:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1067.eqiad.wmnet with OS bullseye [10:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:44] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1067.eqiad.wmnet with OS bullseye [10:19:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:08] (03CR) 10Jbond: Make SREBatchBase operate on host groups (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [10:27:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) No problem @nskaggs I'm off today but I can put some more verbose instructions together next week and link t... [10:30:05] (03CR) 10JMeybohm: [C: 03+1] service::docker: refresh service when config file is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori) [10:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:32:35] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1067.eqiad.wmnet with reason: host reimage [10:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1067.eqiad.wmnet with reason: host reimage [10:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:44] (03CR) 10Alexandros Kosiaris: "While this will work, it's not backwards compatible. Older versions of the cxserver image won't be able to be deployed with this change as" [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 (owner: 10KartikMistry) [10:46:08] (03CR) 10Joal: [C: 03+1] "Thanks Ben :)" [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis) [10:49:08] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:53:45] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:54:59] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:56:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [10:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1067.eqiad.wmnet with OS bullseye [10:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:15] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1067.eqiad.wmnet with OS bullseye completed: - ms-be1067 (**PASS**) - Downtim... [11:01:17] (03PS4) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged [puppet] - 10https://gerrit.wikimedia.org/r/803560 [11:01:20] (03CR) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond) [11:02:59] (03CR) 10Jbond: [C: 03+2] Add page metadata to Wikibase JSON dumps [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) (owner: 10Mitar) [11:08:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:09:35] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:13:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [11:13:36] (03CR) 10KartikMistry: Update nodejs -> node command (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 (owner: 10KartikMistry) [11:17:29] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:23:00] (03CR) 10Jbond: [C: 03+2] eventschemas: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804470 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [11:23:21] (03CR) 10Jbond: [C: 03+2] etcdmirror: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804471 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [11:24:06] (03CR) 10Jbond: [C: 03+2] envoyproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804472 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [11:24:41] (03CR) 10Jbond: [C: 03+2] galera: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804466 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [11:24:55] (03CR) 10Jbond: [C: 03+2] fifo_log_demux: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804467 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [11:34:15] (03CR) 10Jelto: "that looks mostly good to me." [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [11:35:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:54:37] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:00:32] (03PS9) 10Krinkle: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [12:00:48] (03PS10) 10Krinkle: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [12:00:59] (03CR) 10Krinkle: [C: 03+1] Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [12:01:03] (03PS8) 10Krinkle: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling) [12:01:18] (03PS9) 10Krinkle: Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling) [12:01:22] (03CR) 10Krinkle: [C: 03+1] Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling) [12:06:31] RECOVERY - Kafka Broker Server on kafka-test1007 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [12:10:02] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Looks good to me. 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802947 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [12:11:10] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [12:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:17] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [12:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:24] (03CR) 10Thiemo Kreuz (WMDE): "This is a work-in-progress, isn't it? I mean, what's the plan with these TODOs?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802946 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [12:12:13] RECOVERY - Check systemd state on kafka-test1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:15] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [12:12:33] RECOVERY - Kafka broker TLS certificate validity on kafka-test1007 is OK: SSL OK - Certificate kafka-test1007.eqiad.wmnet valid until 2023-01-24 11:32:00 +0000 (expires in 227 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [12:12:37] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] phpcs: move Misleading$wgDebugLogFile exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802840 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [12:13:11] RECOVERY - Kafka Broker Server on kafka-test1008 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [12:13:39] RECOVERY - Check systemd state on kafka-test1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:21] RECOVERY - Kafka broker TLS certificate validity on kafka-test1008 is OK: SSL OK - Certificate kafka-test1008.eqiad.wmnet valid until 2023-01-24 11:32:00 +0000 (expires in 227 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [12:16:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-test1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1006 [12:16:32] (03CR) 10Thiemo Kreuz (WMDE): phpcs: move AssignmentInControlStructures exclusion inline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796360 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [12:17:25] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-test1010 is CRITICAL: 16 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1010 [12:18:07] RECOVERY - Kafka broker TLS certificate validity on kafka-test1009 is OK: SSL OK - Certificate kafka-test1009.eqiad.wmnet valid until 2023-01-24 11:31:00 +0000 (expires in 227 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [12:18:49] RECOVERY - Kafka Broker Server on kafka-test1009 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [12:19:41] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-test1010 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1010 [12:19:43] RECOVERY - Check systemd state on kafka-test1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:29] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:23:55] RECOVERY - Disk space on kafka-test1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1007&var-datasource=eqiad+prometheus/ops [12:28:23] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:28:31] RECOVERY - Disk space on kafka-test1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1008&var-datasource=eqiad+prometheus/ops [12:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:30:56] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1042.eqiad.wmnet [12:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:19] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:32:37] (03PS1) 10Hashar: deployment-prep: add keyholder agent for scap [puppet] - 10https://gerrit.wikimedia.org/r/804568 (https://phabricator.wikimedia.org/T310354) [12:36:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:36:23] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 674 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:36:47] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1042.eqiad.wmnet [12:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:43] RECOVERY - Disk space on kafka-test1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kafka-test1009&var-datasource=eqiad+prometheus/ops [12:38:37] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:43:33] (03CR) 10Ottomata: [C: 03+1] Increase the JVM heap for the Hadoop namenode servers [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis) [12:44:28] (03PS1) 10Jbond: scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 [12:45:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, this looks like a leftover when role::wdqs::autodeploy was removed." [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede) [12:46:10] (03CR) 10Jbond: scap: update venv to use the system ca bundle (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [12:46:28] (03CR) 10Jbond: "FYI i have manually applied this changed to netbox1002" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [12:46:46] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [12:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:16] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin1001" [12:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:30] 10SRE, 10Keyholder: After arming a new key in keyholder, the identity file path does not show up - https://phabricator.wikimedia.org/T257329 (10hashar) A few years later the comment showing up instead of the file has hit me T310354#7994473 The fix above is to set the key comment to use the path using `ssh-key... [12:51:47] (03PS1) 10Btullis: Decrease the retention time on the kafka-test cluster to 1 day [puppet] - 10https://gerrit.wikimedia.org/r/804573 (https://phabricator.wikimedia.org/T310342) [12:55:12] (03CR) 10Btullis: [V: 03+1 C: 03+2] Increase the JVM heap for the Hadoop namenode servers [puppet] - 10https://gerrit.wikimedia.org/r/804551 (https://phabricator.wikimedia.org/T310293) (owner: 10Btullis) [12:58:00] (03PS2) 10Samtar: Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802685 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson) [13:02:14] (03PS1) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 [13:04:07] (03CR) 10Jforrester: Use a service locator to get a job runner (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01) [13:04:27] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/804568 (https://phabricator.wikimedia.org/T310354) (owner: 10Hashar) [13:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:09:56] kostajh: I am finally reaching your docker image change https://gerrit.wikimedia.org/r/c/integration/config/+/803487 :) [13:11:23] hashar: that probably needs to wait for the php unit entry point patch to be merged again [13:11:47] ah yeah probably [13:12:40] or we make the coverage shell script to detect whether tests/phpunit/phpunit.php is present [13:12:41] hmm [13:13:37] (03CR) 10Ottomata: [C: 03+1] Decrease the retention time on the kafka-test cluster to 1 day [puppet] - 10https://gerrit.wikimedia.org/r/804573 (https://phabricator.wikimedia.org/T310342) (owner: 10Btullis) [13:14:43] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:26:28] (03CR) 10Btullis: [C: 03+2] Decrease the retention time on the kafka-test cluster to 1 day [puppet] - 10https://gerrit.wikimedia.org/r/804573 (https://phabricator.wikimedia.org/T310342) (owner: 10Btullis) [13:32:15] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] phabricator: add blackbox http check [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:35:59] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:50:31] (03PS1) 10Krinkle: Profiler: Fix reporting of Redis timeout error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804584 [13:51:45] (JobUnavailable) firing: Reduced availability for job probes/custom in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:55:50] (03CR) 10Filippo Giunchedi: [C: 03+2] netops: add PingUnreachable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:56:17] PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:56:45] (JobUnavailable) resolved: Reduced availability for job probes/custom in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:57:15] (JobUnavailable) firing: Reduced availability for job probes/custom in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:00:57] RECOVERY - Host sretest1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:02:00] (JobUnavailable) resolved: Reduced availability for job probes/custom in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:39] PROBLEM - SSH on sretest1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:09:41] RECOVERY - SSH on sretest1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:14:00] (03CR) 10Krinkle: [C: 03+2] Profiler: Fix reporting of Redis timeout error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804584 (owner: 10Krinkle) [14:15:07] (03Merged) 10jenkins-bot: Profiler: Fix reporting of Redis timeout error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804584 (owner: 10Krinkle) [14:20:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:21:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:45] 10SRE, 10VPS-project-Codesearch, 10observability: add operations/alerts.git to hound codesearch.wmcloud.org - https://phabricator.wikimedia.org/T310364 (10CDanis) [14:25:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:45] 10SRE, 10Librarization, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 4 others: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848 (10Krinkle) [14:31:18] (03PS1) 10Btullis: Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T310359) [14:33:38] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35817/console" [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T310359) (owner: 10Btullis) [14:35:16] (03PS2) 10Btullis: Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T310359) [14:35:33] !log krinkle@deploy1002 Synchronized src/Profiler.php: (no justification provided) (duration: 03m 43s) [14:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:09] (03CR) 10Btullis: "My PCC run didn't bring about any change on alert1001 but maybe that is because these are virtual resources?" [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T310359) (owner: 10Btullis) [14:40:33] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:43] 10SRE, 10VPS-project-Codesearch, 10observability: add operations/alerts.git to hound codesearch.wmcloud.org - https://phabricator.wikimedia.org/T310364 (10Volans) Duplicate of T306881 ? [14:42:44] (03Abandoned) 10Jforrester: Partial revert "TextHandler::getTextTracksFromRows(): Remove unused code" [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802952 (https://phabricator.wikimedia.org/T309873) (owner: 10Jforrester) [14:46:25] (03Abandoned) 10Aqu: Reference the pid file used by the scheduler.service [puppet] - 10https://gerrit.wikimedia.org/r/803396 (https://phabricator.wikimedia.org/T310042) (owner: 10Aqu) [14:49:30] (03PS1) 10Andrew Bogott: heat: use the internal keystone port for keystone_authtoken config [puppet] - 10https://gerrit.wikimedia.org/r/804595 [14:50:39] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected v [14:50:39] path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [14:52:07] (03CR) 10Andrew Bogott: [C: 03+2] heat: use the internal keystone port for keystone_authtoken config [puppet] - 10https://gerrit.wikimedia.org/r/804595 (owner: 10Andrew Bogott) [14:54:01] (03CR) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns) [14:54:17] PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:27] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:56:01] (03CR) 10Krinkle: docroot: Improve design of noc.wikimedia.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800680 (owner: 10Ladsgroup) [14:56:08] couple of 503s on test.wiki/en.wiki, intermittent [14:56:11] RECOVERY - Host sretest1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:56:19] Hello 503 my old friend [14:56:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) @papaul, do you have interest in working on this more or should I take back the task? I'm thinking we should probably cu... [14:56:59] 10SRE, 10VPS-project-Codesearch, 10observability: add operations/alerts.git to hound codesearch.wmcloud.org - https://phabricator.wikimedia.org/T310364 (10CDanis) [14:57:01] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [14:57:19] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:57:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [14:57:41] * Emperor here [14:57:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:58:01] checking too [14:58:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:58:06] hey [14:58:35] Things are fine here if that helps [14:59:08] (03PS3) 10Btullis: Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) [15:00:04] o/ [15:00:40] [discussion in the other place] [15:00:42] we're in _security [15:01:57] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat [15:01:57] ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [15:02:19] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [15:03:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:05:44] (03PS3) 10Ori: service::docker: refresh service when config file is changed [puppet] - 10https://gerrit.wikimedia.org/r/799420 [15:06:26] (03CR) 10Ori: service::docker: refresh service when config file is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori) [15:06:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:16:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:22:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:23:15] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1043.eqiad.wmnet [15:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:05] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:26:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [15:27:43] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 30.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:28:35] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1043.eqiad.wmnet [15:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:37] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 20.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:29:41] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 33.43 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:31:16] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10TheresNoTime) > Visit any Wikimedia project 2 minutes ago, any page //unable to reproduce currently — time machine broken// ( **/j** ) [15:31:51] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:31:57] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 86.53 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:32:00] TheresNoTime: I'll let you know if I managed to fix mine [15:32:11] (03PS2) 10Krinkle: noc: Redesign noc.wikimedia.org after Wikimedia Design Style Guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800680 (owner: 10Ladsgroup) [15:32:15] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 98.59 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:34:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Papaul) @andrew agree. I think the same partman recipe can do it by just removing the section below ` # setup the SDB disk with... [15:36:17] (03PS17) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) [15:38:30] (03CR) 10Ahmon Dancy: [C: 03+1] mediawiki: disable revalidation everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:38:41] 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10dancy) 05Open→03Resolved a:03dancy [15:44:39] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestCl [15:44:39] apt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buf [15:44:39] a: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [15:46:40] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10AlexisJazz) >>! In T310368#7995017, @TheresNoTime wrote: >> Visit any Wikimedia project 2 minutes ago, any page > //unable to reproduce currently — time machine broken// ( **/j** ) I tried to repo... [15:47:19] (03PS3) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 [15:47:53] PROBLEM - SSH on sretest1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:48:12] (03CR) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns) [15:48:12] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi There was indeed a brief moment of unavailability (retroactively-posted incident at https://www.wikimediastatus.net/incidents/5k90l09x2p6k) I'm op... [15:48:20] (03CR) 10CI reject: [V: 04-1] airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns) [15:49:59] RECOVERY - SSH on sretest1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:50:17] (03CR) 10Dduvall: Provide buildkitd to GitLab runners (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [15:51:01] (03PS4) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 [15:58:13] (03PS3) 10Samtar: crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) [15:58:14] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10AlexisJazz) >>! In T310368#7995052, @fgiunchedi wrote: > There was indeed a brief moment of unavailability (retroactively-posted incident at https://www.wikimediastatus.net/incidents/5k90l09x2p6k) >... [15:58:22] (03PS3) 10Samtar: ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) [16:00:13] (03PS4) 10Btullis: Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) [16:07:41] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10fgiunchedi) >>! In T310368#7995061, @AlexisJazz wrote: >>>! In T310368#7995052, @fgiunchedi wrote: >> There was indeed a brief moment of unavailability (retroactively-posted incident at https://www.... [16:25:53] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat [16:25:53] ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [16:28:03] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) is CRITICAL: Test Translate enwiki protected page returned the unexpected status 404 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with [16:28:03] ted value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [16:30:38] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:35:39] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:37:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:52] 10SRE, 10Traffic, 10Wikimedia-Incident: 503 Service Unavailable - https://phabricator.wikimedia.org/T310368 (10AlexisJazz) >>! In T310368#7995101, @CDanis wrote: >>>! In T310368#7995061, @AlexisJazz wrote: >> That just says "From 14:55 to 15:01 UTC users have been experiencing slow/unavailable access to Wiki... [16:46:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:51] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Jclark-ctr) main board swapped. Tech just left [16:52:25] (03PS5) 10Ottomata: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns) [16:54:40] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/pcc-worker1002/35820/" [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns) [16:55:23] (03CR) 10Ottomata: [C: 03+2] airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns) [16:58:11] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:03:19] (03PS5) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) [17:03:21] (03PS2) 10BCornwall: Traffic Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) [17:03:46] (03PS6) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) [17:03:48] (03PS3) 10BCornwall: Traffic Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) [17:04:33] (03CR) 10BCornwall: Traffic Add alert for Varnish child restart (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [17:04:55] (03CR) 10BCornwall: "Thanks for that. Duh!" [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [17:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:05:45] (03PS4) 10BCornwall: Traffic: Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) [17:09:48] jbond here? [17:13:27] (03CR) 10Ottomata: [C: 03+1] Add the analytics contact group to all relevant hosts in icinga [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [17:19:56] (03CR) 10Dzahn: [C: 03+2] "alright, thanks Filippo" [puppet] - 10https://gerrit.wikimedia.org/r/804416 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:23:24] 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Dzahn) @Arnoldokoth The last change we had uploaded in our meeting the other day is now merged. I would say we can call this resolved and close the ticket (but also creat... [17:29:42] (03PS6) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [17:32:23] (03PS7) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [17:33:17] (03CR) 10CI reject: [V: 04-1] Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [17:34:52] (03PS3) 10Ssingh: DHCP: make doh and durum hosts use the bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn) [17:35:51] (03CR) 10Dduvall: Provide buildkitd to GitLab runners (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [17:36:18] (03CR) 10Ssingh: [C: 03+2] DHCP: make doh and durum hosts use the bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn) [17:36:26] :) [17:36:31] :D [17:36:56] (03PS8) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [17:37:25] (03PS9) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [17:37:47] (03CR) 10Zabe: vrts: rename cleanup cache service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:41:23] (03PS10) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [17:41:51] (03CR) 10Dzahn: [C: 03+2] vrts: rename cleanup cache service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:42:15] (03CR) 10Dzahn: [C: 03+2] vrts: rename cleanup cache service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:42:30] (03CR) 10Dduvall: Provide buildkitd to GitLab runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [17:46:13] (03CR) 10Dduvall: "Thanks for the review, Jelto. I think I've addressed all of your comments. I do agree that the use of of a specific docket network adds a " [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [17:55:09] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:05:49] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:11:32] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) Is https://phabricator.wikimedia.org/T305568#7992483 relevant here as well? TL;DR were these provisioned as Cassandra hosts with two additional IPs(/DNS) f... [18:19:36] (03PS7) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [18:20:57] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:44] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10XCollazo-WMF) [18:32:55] PROBLEM - Host cp1089 is DOWN: PING CRITICAL - Packet loss = 100% [18:33:44] oh? [18:34:20] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [18:35:40] sigh, memory issues [18:35:59] ^ filing a task for cp1089 and downtiming it for now [18:37:02] sounds like a....... dimm situation?? [18:37:04] thanks sukhe [18:37:08] yep [18:37:08] Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. [18:38:34] 10SRE, 10ops-codfw: Degraded RAID on ms-be2066 - https://phabricator.wikimedia.org/T309595 (10wiki_willy) a:03Papaul [18:39:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1089 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T310387 (10ssingh) [18:40:04] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp1089.eqiad.wmnet with reason: downtimed because of DIMM replacement: T310387 [18:40:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp1089.eqiad.wmnet with reason: downtimed because of DIMM replacement: T310387 [18:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:11] T310387: cp1089 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T310387 [18:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1089.eqiad.wmnet,service=ats-be [18:42:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1089.eqiad.wmnet,service=varnish-fe [18:42:06] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1089.eqiad.wmnet,service=ats-tls [18:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:03] (03CR) 10JMeybohm: "This change is ready for review." (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [19:02:01] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:25:15] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:29:26] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet [19:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:23] (03PS1) 10Andrew Bogott: Partman: give up on a two-hwraid configure and just configure the first drive. [puppet] - 10https://gerrit.wikimedia.org/r/804633 (https://phabricator.wikimedia.org/T302981) [19:34:31] (03CR) 10Andrew Bogott: [C: 03+2] Partman: give up on a two-hwraid configure and just configure the first drive. [puppet] - 10https://gerrit.wikimedia.org/r/804633 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [19:35:18] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet [19:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:56] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [19:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:10] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:58:05] ^ is expired downtime. I've let b.tulis know [20:00:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:14:26] (03PS1) 10CDanis: only page for NEL after 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/804640 [20:18:40] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat [20:18:40] ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [20:20:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:23:00] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat [20:23:00] ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [20:23:24] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:25:26] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye [20:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:36] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [20:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:06] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:30:58] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10bscarone) @CDanis once I am on `bast1003.wikimedia.org` and ssh `stat1005.eqiad.wmnet` or `stat1008.eqiad.wmnet` I am prompted to enter a password, so I... [20:31:19] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10bscarone) 05Resolved→03Open [20:32:48] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10CDanis) >>! In T310021#7995684, @bscarone wrote: > @CDanis once I am on `bast1003.wikimedia.org` and ssh `stat1005.eqiad.wmnet` or `stat1008.eqiad.wmnet`... [20:35:09] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10dancy) [20:35:20] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: etc [puppet] - 10https://gerrit.wikimedia.org/r/804644 [20:36:34] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: etc [puppet] - 10https://gerrit.wikimedia.org/r/804644 (owner: 10Andrew Bogott) [20:36:57] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye [20:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:24] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [20:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:48] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10bscarone) Oh, I see, that was the issue. Now I managed to do it, thank you, I will close the task. [20:40:51] (03PS11) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [20:41:06] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10bscarone) 05Open→03Resolved [20:52:17] (03PS1) 10Andrew Bogott: partman/hwraid-2dev.cfg: yet more etc [puppet] - 10https://gerrit.wikimedia.org/r/804645 [20:53:52] (03CR) 10Andrew Bogott: [C: 03+2] partman/hwraid-2dev.cfg: yet more etc [puppet] - 10https://gerrit.wikimedia.org/r/804645 (owner: 10Andrew Bogott) [20:54:28] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye [20:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:49] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [20:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:40] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:20:46] (03CR) 10Krinkle: [C: 03+1] Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [21:23:30] (03PS3) 10Krinkle: mediawiki: disable revalidation for api,app,parsoid clusters [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [21:25:03] (03PS4) 10Krinkle: mediawiki: disable revalidation for api,app,parsoid clusters [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [21:25:10] (03CR) 10Krinkle: [C: 03+1] mediawiki: disable revalidation for api,app,parsoid clusters [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [21:27:46] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:32:20] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35822/" [puppet] - 10https://gerrit.wikimedia.org/r/778243 (owner: 10Dzahn) [21:37:54] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:39:50] ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough daniel_zahn RhinosF1 already pinged btullis https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:41:26] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [21:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:07] Thanks mutante [21:42:36] That was deliberately broken. The raid battery has been swapped with another server as I think it's going eventually. [21:43:04] 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - commonswiki_file - https://phabricator.wikimedia.org/T310400 (10Dzahn) [21:44:35] 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - commonswiki_file - https://phabricator.wikimedia.org/T310400 (10Dzahn) [21:44:48] mutante: that's a dupe task [21:45:10] 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - unassigned shard / commonswiki_file - https://phabricator.wikimedia.org/T310400 (10RhinosF1) [21:45:16] (03CR) 10Krinkle: [C: 04-1] multiversion: Simplify code and improve documentation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 (owner: 10Krinkle) [21:45:22] 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - unassigned shard / commonswiki_file - https://phabricator.wikimedia.org/T310400 (10Dzahn) 05duplicate→03Open [21:46:01] 10SRE: cloudelastic1001 through cloudelastic1006: CRITICAL - unassigned shard / commonswiki_file - https://phabricator.wikimedia.org/T310400 (10Dzahn) [21:46:40] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1001 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:46:40] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1002 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:46:40] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1003 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:46:40] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1004 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:46:40] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1005 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:46:41] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1006 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) daniel_zahn https://phabricator.wikimedia.org/T309648 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:49:49] !log acking unhandled crit alerts on cloud dev hosts [21:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:29] !log miscweb1002 - logrotate service was broken for unknown reasons, no recent change [21:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:01] (03CR) 10CDanis: [C: 03+1] admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) (owner: 10Elukey) [21:58:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10CDanis) LGTM for clinic duty [21:59:01] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10CDanis) 05Open→03Resolved a:03CDanis Done! [21:59:49] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10CDanis) a:03KFrancis @KFrancis can you confirm an NDA on file for this researcher? Thanks! [22:00:16] !log miscweb1002 - systemctl start logrotate (it worked on second attempt, uh?, but it worked now) - systemctl reset-failed to clear icinga alerts [22:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:34] RECOVERY - Check systemd state on miscweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:03] !log mirror1001 - nginx service failed since > 1 month and unhandled alert - site is up though [22:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:00] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:22] !log mirror1001 - monitored nginx - package was in state "rc" and apache is running instead. systemctl reset-failed cleared alerts [22:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:35] 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10wiki_willy) Updated estimate, excluding the EX4200s and EX4300, is attached: {F35226351} The recycling pickup will be Wednesday (June 15) between 8a-12p ET and the onsite drive shredding will be later th... [22:39:14] (03PS1) 10Krinkle: noc: Add a menu in the new design, add some additional links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804670 [22:43:06] (03PS3) 10Krinkle: multiversion: Simplify code and improve documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 [23:08:18] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:27:24] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook