[00:00:08] (03CR) 10Krinkle: [C: 03+1] "I reproduced this as follows:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [00:00:21] (03PS4) 10Tim Starling: Remove PHP 7.4 version check and prepare for title case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) [00:00:23] (03PS4) 10Tim Starling: Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552) [00:03:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:09:34] (03PS3) 10CDanis: Re-introduce newconnrate [puppet] - 10https://gerrit.wikimedia.org/r/842539 (https://phabricator.wikimedia.org/T306580) [00:09:56] (03CR) 10CDanis: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1001/37549/" [puppet] - 10https://gerrit.wikimedia.org/r/842539 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [00:10:45] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:28:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:29:55] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2025-2027] [00:32:53] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:33:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:33:39] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:33:57] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:34:56] (03PS1) 10Ryan Kemper: elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547 [00:35:44] (03PS2) 10Ryan Kemper: elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547 [00:36:36] !log T300943 Decom'ing elastic20[25-36]. Decommissioning in batches by row, starting with row A (2025-27) [00:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:41] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [00:38:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:43:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:43:09] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:23] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=elastic2025* [00:44:05] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [00:45:57] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=elastic2026.codfw.wmnet [00:48:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:48:08] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[2025-2027] [00:49:55] !log [Elastic] `ryankemper@elastic1083:~$ sudo systemctl restart elasticsearch_7*` to clear `CirrusSearchJVMGCYoungPoolInsufficient` [00:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:40] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2028-2030] [00:53:02] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [01:06:43] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:13:14] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [01:14:33] (03CR) 10Tim Starling: [C: 03+2] Remove PHP 7.4 version check and prepare for title case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [01:15:26] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:15:27] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[2028-2030] [01:15:48] (03Merged) 10jenkins-bot: Remove PHP 7.4 version check and prepare for title case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [01:19:42] 10ops-eqiad: eqaid: duplicate serial: - https://phabricator.wikimedia.org/T320772 (10Papaul) [01:20:48] !log tstarling@deploy1002 Synchronized wmf-config/UcfirstOverrides.php: for T292552, should have no effect at this stage (duration: 03m 46s) [01:20:53] T292552: Rename articles and users to prepare for PHP 7.3 unicode changes - https://phabricator.wikimedia.org/T292552 [01:26:43] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 03m 36s) [01:32:14] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2031-2033].codfw.wmnet [01:35:04] (03PS3) 10Ryan Kemper: elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547 (https://phabricator.wikimedia.org/T300943) [01:37:06] (03PS4) 10Ryan Kemper: elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547 (https://phabricator.wikimedia.org/T300943) [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:40:20] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [01:42:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:42:17] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[2031-2033].codfw.wmnet [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:57] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:41] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:59:18] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2034,2036].codfw.wmnet [02:01:12] !log T300943 Final batch of decom'ing `elastic20[25-36]` => already decommissioned rows A/B/C; starting final row D (corresponding to `203[4,6]`) [02:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:17] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [02:05:17] (03CR) 10Ryan Kemper: [C: 03+2] elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547 (https://phabricator.wikimedia.org/T300943) (owner: 10Ryan Kemper) [02:05:50] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:58] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:10:59] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[2034,2036].codfw.wmnet [02:11:26] !log T300943 Decom of elastic20[25-36] complete. Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/842547. This is done [02:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:31] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [02:12:00] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [02:13:20] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:24:41] !log tstarling@deploy1002 Synchronized wmf-config: clean up deleted file (duration: 03m 46s) [02:54:39] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:06:09] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 110 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:07:59] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:08:23] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 11 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:17:09] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:22:01] PROBLEM - MariaDB Replica Lag: s4 #page on db1143 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1306.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:30:51] PROBLEM - DNS on elastic2026.mgmt is CRITICAL: Domain elastic2026.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:30:51] PROBLEM - DNS on elastic2027.mgmt is CRITICAL: Domain elastic2027.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:30:51] PROBLEM - DNS on elastic2028.mgmt is CRITICAL: Domain elastic2028.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:30:51] PROBLEM - DNS on elastic2030.mgmt is CRITICAL: Domain elastic2030.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:30:51] PROBLEM - DNS on elastic2029.mgmt is CRITICAL: Domain elastic2029.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:35:21] PROBLEM - DNS on elastic2025.mgmt is CRITICAL: Domain elastic2025.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:42:23] !log oblivian@cumin1001 dbctl commit (dc=all): 'depool db1143, lagging', diff saved to https://phabricator.wikimedia.org/P35485 and previous config saved to /var/cache/conftool/dbconfig/20221014-034223-oblivian.json [03:50:53] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:34:37] PROBLEM - Host elastic2025 is DOWN: PING CRITICAL - Packet loss = 100% [04:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:38:23] PROBLEM - Host elastic2026 is DOWN: PING CRITICAL - Packet loss = 100% [04:40:01] PROBLEM - Host elastic2027 is DOWN: PING CRITICAL - Packet loss = 100% [04:46:55] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:53:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [05:01:15] PROBLEM - Host elastic2028 is DOWN: PING CRITICAL - Packet loss = 100% [05:06:51] PROBLEM - Host elastic2029 is DOWN: PING CRITICAL - Packet loss = 100% [05:09:25] PROBLEM - Host elastic2030 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:33] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:25:39] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:27:33] <_joe_> uh wat [06:27:52] <_joe_> we actually lost half a row of ES in codfw? [06:27:57] <_joe_> why isn't this alerting [06:29:23] <_joe_> ah these are machines to decom apparently [06:29:47] <_joe_> they're not in manifest/site.pp anymore, making it even more confusing [06:37:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Not working well [06:37:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Not working well [06:43:10] (03CR) 10Ayounsi: [C: 03+1] "Assuming the linked changes get approved. The logic looks good to me, but I can't mentally interpret it and see what it would look like. M" [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [06:45:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 7843 [06:46:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 7843 [06:52:09] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:51] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221014T0700) [07:16:47] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1008.eqiad.wmnet with reason: Remove from cluster for eventual decom [07:17:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1008.eqiad.wmnet with reason: Remove from cluster for eventual decom [07:18:34] (03PS2) 10Muehlenhoff: Remove ganeti role from ganeti1008 [puppet] - 10https://gerrit.wikimedia.org/r/842510 (https://phabricator.wikimedia.org/T320419) [07:21:23] (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti role from ganeti1008 [puppet] - 10https://gerrit.wikimedia.org/r/842510 (https://phabricator.wikimedia.org/T320419) (owner: 10Muehlenhoff) [07:24:05] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:25:21] _joe_: those instances have been decom'd, it looks like 5 of them are still showing up in icinga though [07:25:38] <_joe_> ryankemper: yeah hence my confusion [07:29:11] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [07:30:40] they are still in puppetdb and debmonitor, so something must have been off with the run of the decom cookbook [07:33:58] ryankemper: it seems when running the decom cookbook partially a botched query was submitted, I'm seeing "Query 'elastic20[28-30]' did not match any host or failed" in the logs [07:34:23] so simply re-running the cookbook should fix it [07:36:07] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1005.eqiad.wmnet [07:36:42] moritzm: thanks, and yeah I can see I ran it like `elastic20[28-30]` instead of `elastic20[28-30]*` [07:37:31] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2025-2027].codfw.wmnet [07:41:36] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:43:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:43:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1005.eqiad.wmnet [07:44:36] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:45:00] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1006.eqiad.wmnet [07:51:13] (03PS1) 10David Caro: ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 [07:54:27] (03CR) 10CI reject: [V: 04-1] ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 (owner: 10David Caro) [07:54:55] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:55:32] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:56:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:56:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1006.eqiad.wmnet [07:57:48] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1007.eqiad.wmnet [08:02:29] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:04:45] (03PS2) 10David Caro: ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 [08:05:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:05:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1007.eqiad.wmnet [08:07:21] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1008.eqiad.wmnet [08:12:20] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:14:10] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [08:15:20] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:15:21] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic[2025-2027].codfw.wmnet [08:15:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:15:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1008.eqiad.wmnet [08:21:46] (03PS1) 10Muehlenhoff: Remove remaining Puppet references for ganeti1005-1008 [puppet] - 10https://gerrit.wikimedia.org/r/842694 (https://phabricator.wikimedia.org/T320419) [08:26:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove remaining Puppet references for ganeti1005-1008 [puppet] - 10https://gerrit.wikimedia.org/r/842694 (https://phabricator.wikimedia.org/T320419) (owner: 10Muehlenhoff) [08:29:02] !log installing git security updates on buster [08:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:28] 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10EChetty) [08:31:25] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2028-2030].codfw.wmnet [08:32:51] (03CR) 10FNegri: [C: 03+1] ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 (owner: 10David Caro) [08:35:25] 10ops-eqiad, 10decommission-hardware: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 (10MoritzMuehlenhoff) a:03Jclark-ctr These are ready for DC ops unracking tasks. [08:37:28] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:44:43] (03CR) 10David Caro: [C: 03+2] ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 (owner: 10David Caro) [08:46:05] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [08:46:35] (03CR) 10David Caro: [C: 04-1] alerts.downtime_host: attempt to match alert hostnames with : (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [08:47:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:47:17] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic[2028-2030].codfw.wmnet [08:47:25] (03CR) 10David Caro: [C: 03+2] Revert "cloudbackups: run nfs backups from labstore1004 rather than 1005" [puppet] - 10https://gerrit.wikimedia.org/r/838090 (owner: 10David Caro) [08:48:08] (03Merged) 10jenkins-bot: ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 (owner: 10David Caro) [08:53:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [09:14:25] (03PS9) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (https://phabricator.wikimedia.org/T320782) [09:14:51] (03CR) 10CI reject: [V: 04-1] New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (https://phabricator.wikimedia.org/T320782) (owner: 10Giuseppe Lavagetto) [09:22:06] (03PS1) 10Elukey: ml-services: update revscoring Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374) [09:23:06] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10MatthewVernon) [[ https://en.wikipedia.org/wiki/Louhi | Louhi ]], the shape-changing witch-queen from the Kalevala? I don't think currently in use as software-name... [09:27:05] (03CR) 10CI reject: [V: 04-1] ml-services: update revscoring Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:27:18] (03CR) 10AikoChou: [C: 03+1] ml-services: update revscoring Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [09:35:07] (03CR) 10Jbond: wmflib::ansi: add new ansi formatting function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond) [09:35:20] (03PS4) 10Jbond: wmflib::ansi: add new ansi formatting function [puppet] - 10https://gerrit.wikimedia.org/r/842496 [09:36:17] (03PS6) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [09:42:36] PROBLEM - Dell PowerEdge RAID Controller on db1202 is CRITICAL: communication 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [09:42:37] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db1202 is CRITICAL: communication 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T320786 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [09:42:42] 10SRE, 10ops-eqiad: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10ops-monitoring-bot) [09:45:52] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:53:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond) [09:53:22] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:44] (03CR) 10Jbond: P:netbox::host: create a motd for the status (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [09:58:37] (03PS7) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) [10:02:18] 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10cmooney) Charon is also the StrongSwan IKEv2 daemon: https://docs.strongswan.org/docs/5.9/daemons/charon.html >>! In T319409#8316323, @MatthewVernon wrote: > [[ https://en.wikipedia.org/wiki/Louhi | Lou... [10:11:29] 10SRE, 10observability, 10serviceops, 10Maps (Kartotherian): Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748 (10fgiunchedi) With my Observability/Prometheus hat on: to bridge the statsd/prometheus gap we've been deploying `profile::prometheus::statsd_exporter` e.g.... [10:13:08] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:24] !log Deployed patch for T320785 [10:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:52] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [10:19:38] (03PS1) 10Filippo Giunchedi: aptrepo: add trailing newline to "updates" [puppet] - 10https://gerrit.wikimedia.org/r/842703 [10:20:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:35] if anyone is up for a trivial review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/842703/ [10:22:01] !log upgrade grafana to 8.5.14 [10:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:18] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:31:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/842703 (owner: 10Filippo Giunchedi) [10:44:52] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:58:46] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:18:29] (03CR) 10Muehlenhoff: "This is getting quite ready! I did another pass, but most of them are smaller nits/comments." [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [11:20:20] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Clement_Goubert) Just for confirmation before diving into it on Monday, the list of services to re-de... [11:20:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [11:31:26] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:43:34] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >>! In T319067#8314210, @BBlack wrote: > Is it possible to fake this out with a bunch of trivially-built empty udebs that are in our r... [11:45:12] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: add trailing newline to "updates" [puppet] - 10https://gerrit.wikimedia.org/r/842703 (owner: 10Filippo Giunchedi) [11:46:04] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >>! In T319067#8314213, @ssingh wrote: > On the Traffic side, the image + cookbook patch is working for us. The only issue being -- an... [11:53:56] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Peachey88) [11:56:34] (03PS1) 10Ladsgroup: db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842752 (https://phabricator.wikimedia.org/T320773) [11:58:16] (03PS1) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [11:58:50] (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:01:34] (03CR) 10Muehlenhoff: WIP: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:01:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1202 - Degraded RAID (T320786)', diff saved to https://phabricator.wikimedia.org/P35487 and previous config saved to /var/cache/conftool/dbconfig/20221014-120155-ladsgroup.json [12:02:01] T320786: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 [12:02:25] (03PS2) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:02:59] (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:04:59] (03PS2) 10Ladsgroup: db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842752 (https://phabricator.wikimedia.org/T320773) [12:05:06] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842752 (https://phabricator.wikimedia.org/T320773) (owner: 10Ladsgroup) [12:06:10] (03PS3) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:06:44] (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:06:46] (03PS1) 10Ladsgroup: db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842754 (https://phabricator.wikimedia.org/T320786) [12:07:14] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842754 (https://phabricator.wikimedia.org/T320786) (owner: 10Ladsgroup) [12:07:53] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10MoritzMuehlenhoff) [12:08:27] (03PS4) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:08:55] 10SRE, 10Infrastructure-Foundations: Implement a staging setup - https://phabricator.wikimedia.org/T320795 (10MoritzMuehlenhoff) [12:09:02] (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:12:31] (03PS5) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:13:05] (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:13:19] (03CR) 10Slyngshede: WIP: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:14:23] (03PS6) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:14:58] (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:19:26] (03PS7) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:19:36] 10SRE, 10Infrastructure-Foundations: Initial production deployment - https://phabricator.wikimedia.org/T320797 (10MoritzMuehlenhoff) [12:21:31] (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:23:34] (03PS8) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:24:08] (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:24:32] 10SRE, 10Infrastructure-Foundations: IDM integration into CAS SSO - https://phabricator.wikimedia.org/T320799 (10MoritzMuehlenhoff) [12:25:47] (03PS9) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:26:22] (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [12:27:15] (03PS10) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [12:27:56] 10SRE, 10Infrastructure-Foundations: IDM milestone 3 "Build-out for self service" - https://phabricator.wikimedia.org/T320801 (10MoritzMuehlenhoff) [12:28:25] 10SRE, 10Infrastructure-Foundations: Create a mockup and involve designers - https://phabricator.wikimedia.org/T320802 (10MoritzMuehlenhoff) [12:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:39:22] 10SRE, 10Infrastructure-Foundations: Define the core attribute list managed in the IDM with all stakeholders - https://phabricator.wikimedia.org/T320805 (10MoritzMuehlenhoff) [12:41:23] 10SRE, 10Infrastructure-Foundations: Consider reusing some wiki data sources for signup/restrictions - https://phabricator.wikimedia.org/T320806 (10MoritzMuehlenhoff) [12:42:36] 10SRE, 10Infrastructure-Foundations: Implement OAuth account validation for linking an account to a wiki account - https://phabricator.wikimedia.org/T320807 (10MoritzMuehlenhoff) [12:43:48] 10SRE, 10Infrastructure-Foundations: Implement email address validation workflow - https://phabricator.wikimedia.org/T320808 (10MoritzMuehlenhoff) [12:44:20] (03PS1) 10Muehlenhoff: pontoon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842756 (https://phabricator.wikimedia.org/T308013) [12:44:22] (03PS1) 10Muehlenhoff: paws: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842757 (https://phabricator.wikimedia.org/T308013) [12:44:24] (03PS1) 10Muehlenhoff: wmcs::nfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013) [12:44:26] (03PS1) 10Muehlenhoff: kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842759 (https://phabricator.wikimedia.org/T308013) [12:44:28] (03PS1) 10Muehlenhoff: dumps::generation: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842760 (https://phabricator.wikimedia.org/T308013) [12:44:30] (03PS1) 10Muehlenhoff: kerberos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842761 (https://phabricator.wikimedia.org/T308013) [12:44:32] (03PS1) 10Muehlenhoff: idp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842762 (https://phabricator.wikimedia.org/T308013) [12:44:34] (03PS1) 10Muehlenhoff: statistics : Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842763 (https://phabricator.wikimedia.org/T308013) [12:44:36] (03PS1) 10Muehlenhoff: labs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842764 (https://phabricator.wikimedia.org/T308013) [12:44:38] (03PS1) 10Muehlenhoff: kubernetes: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842765 (https://phabricator.wikimedia.org/T308013) [12:45:08] 10SRE, 10Infrastructure-Foundations: Figure out a captcha option - https://phabricator.wikimedia.org/T320809 (10MoritzMuehlenhoff) [12:47:37] (03PS1) 10Filippo Giunchedi: debian: add packaging [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 [12:53:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:01:37] 10SRE, 10ARM support: SRE Summit 2022 Outcome of Session "Adoption of aarch64 (aka arm64) in WMF production?" - https://phabricator.wikimedia.org/T320811 (10akosiaris) [13:02:09] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata) Correct! [13:02:42] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T319425 (10Papaul) 05Open→03Resolved a:03Papaul The interface is not configure and it is disable [13:04:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:05:02] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) a:05LSobanski→03MatthewVernon [13:05:19] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:25] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:05:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [13:05:50] around [13:06:13] <_joe_> uh oh [13:06:53] head over to _security [13:10:19] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [13:13:23] 10SRE, 10ARM support: SRE Summit 2022 Outcome of Session "Adoption of aarch64 (aka arm64) in WMF production?" - https://phabricator.wikimedia.org/T320811 (10akosiaris) [13:14:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:18:29] (03CR) 10Jbond: [C: 03+1] wmcs::nfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:18:46] (03PS11) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [13:19:54] (03CR) 10Jbond: [C: 03+1] kerberos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842761 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:19:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:23:11] (03PS12) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [13:28:13] (03CR) 10Vgutierrez: [C: 03+1] "looks good, have you considered using http_fail_rate to detect that the upper layers are struggling?" [puppet] - 10https://gerrit.wikimedia.org/r/842539 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [13:34:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:37:07] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:33] (03PS1) 10KartikMistry: Update cxserver to 2022-10-14-080318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/842812 (https://phabricator.wikimedia.org/T319175) [13:43:51] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:09] (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:45:15] RECOVERY - Host mw1314.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [13:46:05] (03CR) 10Elukey: "This is awesome, thanks so much for doing it! I left a comment for the patch file, just to get the purpose of the preamble, the rest looks" [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 (owner: 10Filippo Giunchedi) [13:48:20] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 (10Jclark-ctr) [13:49:11] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [13:49:56] (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [13:50:21] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:51:46] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1004.eqiad.wmnet - https://phabricator.wikimedia.org/T319683 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [13:53:51] (03CR) 10Muehlenhoff: [C: 03+1] "Two nits, looks good otherwise." [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 (owner: 10Filippo Giunchedi) [13:55:39] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:57:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:57:20] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:58:01] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 (10Jclark-ctr) 05Open→03Resolved completed Decom process [13:58:30] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10Jclark-ctr) 05Open→03Resolved completed Decom [13:58:51] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1004.eqiad.wmnet - https://phabricator.wikimedia.org/T319683 (10Jclark-ctr) 05Open→03Resolved completed Decom process [13:59:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842764 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:59:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:00:00] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [14:00:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Jclark-ctr) [14:00:25] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Jclark-ctr) 05Open→03Resolved Finished Decom process [14:06:36] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:09:04] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:09:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:11:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:13:55] (03PS2) 10Muehlenhoff: labs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842764 (https://phabricator.wikimedia.org/T308013) [14:13:57] this is me, we are working on this log spam from k8s :( [14:14:10] (the kafka logging too many msg etc..) [14:16:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:17:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:18:19] (03CR) 10Muehlenhoff: [C: 03+2] labs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842764 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:18:59] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[28-39].eqiad.wmnet - https://phabricator.wikimedia.org/T318691 (10Jclark-ctr) [14:19:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:21:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:21:43] (03CR) 10Herron: [C: 03+1] "Thanks, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/842759 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:22:28] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:22:39] (03PS1) 10Clément Goubert: Remove references to deprecated kubeyaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) [14:24:27] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842756 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:24:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:25:52] (03CR) 10Muehlenhoff: [C: 03+2] pontoon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842756 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:27:01] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:27:20] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:27:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:27:40] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:28:38] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:29:05] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:29:11] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:29:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:29:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:30:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:31:16] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:31:47] PROBLEM - Host mw1314.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:32:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:32:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:35:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:37:50] (03CR) 10Jbond: [C: 03+1] idp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842762 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:37:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:40:40] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:40:49] (03CR) 10Klausman: [C: 03+1] admin_ng: set higher circuit breaking limits for EventGate on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/842494 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:42:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:42:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:42:59] (03PS2) 10Filippo Giunchedi: debian: add packaging [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 [14:43:06] (03CR) 10Filippo Giunchedi: "Thank you for the quick reviews!" [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 (owner: 10Filippo Giunchedi) [14:43:14] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:47:27] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:47:59] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:48:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:49:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:54:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:55:44] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it!" [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 (owner: 10Filippo Giunchedi) [14:55:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:12:52] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T320817 (10phaultfinder) [15:13:14] (03PS1) 10Elukey: knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468) [15:14:05] (03CR) 10CI reject: [V: 04-1] knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey) [15:17:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10dcaro) > But still fairly comfortably within the 10G NIC capcity. What throughput limits were hit? Sorry if I missed them on the dashboard you linked, I d... [15:23:23] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/842394 [15:31:54] (03PS2) 10Elukey: knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468) [15:32:27] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/842395 [15:35:50] (03CR) 10Klausman: [C: 03+1] knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey) [15:37:07] (03CR) 10Elukey: [C: 03+2] knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey) [15:40:17] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:40:18] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:43:29] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:44:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:45:55] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:46:23] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:48:31] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:48:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:49:10] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:52:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) > I did some tests in the past and that was more or less the maximum network throughput I got, so I was expecting for that to be the same (thinki... [15:53:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:14:53] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [16:16:03] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:09] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:18:01] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:21:41] (03PS1) 10Cwhite: logstash: set webrequest index replicas to 0 for large indexes [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099) [16:27:19] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:27:56] (03CR) 10Cwhite: [C: 03+1] "I'd like to deploy this before the next curator run at Oct 15 00:42 UTC." [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [16:30:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:42] (03PS2) 10Cwhite: logstash: set webrequest index replicas to 0 for large indexes [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099) [16:35:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:37:23] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [16:38:01] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:38:35] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:55:23] (03PS1) 10Jbond: puppetdb: create small script to quer puppetdb and give a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [16:55:56] (03CR) 10CI reject: [V: 04-1] puppetdb: create small script to quer puppetdb and give a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond) [16:57:18] (03CR) 10JHathaway: wmflib::ansi: add new ansi formatting function (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond) [16:59:37] (03PS2) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [17:00:10] (03CR) 10CI reject: [V: 04-1] puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond) [17:01:36] (03PS3) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [17:02:10] (03CR) 10CI reject: [V: 04-1] puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond) [17:03:00] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [17:06:17] (03CR) 10Herron: [C: 03+1] logstash: set webrequest index replicas to 0 for large indexes [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [17:07:17] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:08:25] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:11:49] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:15:07] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:16:06] (03PS4) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [17:16:08] (03PS1) 10Jbond: P:puppetdb: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/842854 [17:18:10] (03CR) 10CI reject: [V: 04-1] puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond) [17:18:45] (03CR) 10CI reject: [V: 04-1] P:puppetdb: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/842854 (owner: 10Jbond) [17:19:57] (03PS5) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [17:23:32] (03PS6) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [17:23:53] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:32:28] (03PS1) 10Brennen Bearnes: gitlab runner: allow golang:* images [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825) [17:33:41] (03PS2) 10Brennen Bearnes: gitlab runner: allow golang:* images [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825) [17:37:00] (03CR) 10Addshore: [C: 03+1] gitlab runner: allow golang:* images [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825) (owner: 10Brennen Bearnes) [17:43:58] (03PS7) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [17:46:15] (03CR) 10Cwhite: [C: 03+2] logstash: set webrequest index replicas to 0 for large indexes [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [17:47:54] I'd like to deploy a config patch for beta in a bit, in an hour or so. [17:48:48] I know it's Friday and all that... the change would allow me to test a change to VE that will be riding the train next week. Would be good if I could check it out on beta before the deployment branch. [17:48:52] Any objections? [17:51:36] (03PS8) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [17:55:35] (03PS1) 10Daniel Kinzler: Beta: set $wmgVisualEditorAccessRestbaseDirectly = false for dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703) [17:55:44] This one --^ [17:56:14] beta-only so no objection from me. [17:58:04] Great! I'll have dinner and then do it when my blood sugar is back to normal :) [18:01:36] (03CR) 10D3r1ck01: [C: 03+1] Beta: set $wmgVisualEditorAccessRestbaseDirectly = false for dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703) (owner: 10Daniel Kinzler) [18:03:58] mutante: o/ contint agents are offline and ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/834400 [18:04:10] i've logged in -releng [18:05:21] duesen: fine by me too :) [18:05:37] compiling the change. How about we disable puppet on contint*, then deploy to non-active one.. then to active one [18:05:39] though you might want to get an a-ok from an sre as well [18:06:16] mutante: that sounds good [18:06:38] confirms that 2001 is master [18:06:40] (03PS9) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [18:07:43] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37554/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [18:08:08] !log contint* - temp disabled puppet, deploying gerrit:834400, docker version upgrade on CI servers (T318382) [18:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:13] T318382: Upgrade docker on integration hosts for fixes to BuildKit builder - https://phabricator.wikimedia.org/T318382 [18:08:58] merged, running puppet on contint1001.. disabled on 2001 [18:09:17] and.. it fails [18:09:22] E: Version '5:20.10.18~3-0~debian-buster' for 'docker-ce' was not found [18:09:28] what [18:09:59] the puppet run can finish but it does not find the new version [18:10:26] these are buster [18:10:26] (03PS10) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 [18:10:35] it's a TODO to move them to bullseye and new hardware [18:10:41] is that why? [18:11:07] looking [18:11:08] i think it's because reprepro didn't pull in the latest versions for buster maybe https://phabricator.wikimedia.org/T318382#8271222 [18:11:18] only bullseye [18:11:46] we'll want those updated for buster as well... *sigh* [18:11:53] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup1002), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:12:12] maybe we should just not pin the version. sorry mutante. you can revert the patch and we're suss it out [18:12:18] we can get Version: 5:20.10.12~3-0~debian-buster [18:12:22] *we'll* [18:12:34] that's what's currently installed, yeah [18:13:48] reading the ticket link.. ACK [18:13:54] reverting for right now, ok [18:14:09] sorry about that. i didn't catch it in the comment [18:14:37] no problem [18:14:53] we have new hardware to replace contint* [18:15:04] let's use that to install bullseye [18:15:31] but a contint* server will probably have other stuff to solve for that [18:16:00] (03PS1) 10Dzahn: Revert "P:ci::docker: Upgrade docker to 20.10.18 on all CI agents" [puppet] - 10https://gerrit.wikimedia.org/r/842802 [18:16:04] yeah, that's a bigger task [18:16:31] https://phabricator.wikimedia.org/T294276 [18:16:50] but that's the perfect opportunity to upgrade distro [18:16:52] i think i'll just ask moritzm if he can pull in the newer packages for buster [18:16:54] because it means we have test hosts [18:16:59] without touching the prod CI [18:17:04] which you normally wouldnt have [18:17:27] yea, that too, for short term. +1 [18:17:39] (03CR) 10Bartosz Dziewoński: [C: 03+1] Beta: set $wmgVisualEditorAccessRestbaseDirectly = false for dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703) (owner: 10Daniel Kinzler) [18:17:50] we at least have a newer _enough_ docker-ce now with the buildkit fixes [18:17:55] so that's good [18:17:58] :) [18:18:00] (03CR) 10Dzahn: [C: 03+2] Revert "P:ci::docker: Upgrade docker to 20.10.18 on all CI agents" [puppet] - 10https://gerrit.wikimedia.org/r/842802 (owner: 10Dzahn) [18:18:22] (03CR) 10Dzahn: [C: 03+2] "E: Version '5:20.10.18~3-0~debian-buster' for 'docker-ce' was not found" [puppet] - 10https://gerrit.wikimedia.org/r/842802 (owner: 10Dzahn) [18:18:50] how is the cloud part doing [18:19:02] since the change and revert edited cloud.yaml too [18:20:11] ok, puppet is happy on contint1001. I am re-enabling 2001 [18:20:26] well, so that's a little funny. we have the newer package version for cloud, but not the older one [18:20:37] heh:) [18:20:38] so i had to add a little project-level puppet to bump the version there [18:20:49] i was hoping to take that out as soon as we deployed this change :) [18:21:16] ok. from my side: done. noop on prod CI server the whole time [18:21:21] but the upgrade went fine. no problems with docker so far [18:21:32] puppet runs again as normal [18:21:35] thanks, mutante! i'll re-enable the agents [18:21:42] yw, yep [18:21:54] and sounds good about the upgrade [18:25:03] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:38:35] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:43:05] (03CR) 10Jbond: wmflib::ansi: add new ansi formatting function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond) [18:44:11] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:45:13] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:46:17] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:47:11] (03PS1) 10Andrew Bogott: magnum: use rabbitmq_node rather than openstack_controller for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842862 (https://phabricator.wikimedia.org/T309407) [18:49:26] (03PS2) 10Andrew Bogott: magnum: use rabbitmq_node rather than openstack_controller for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842862 (https://phabricator.wikimedia.org/T309407) [18:52:02] (03CR) 10Andrew Bogott: [C: 03+2] magnum: use rabbitmq_node rather than openstack_controller for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842862 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [19:00:27] (03PS1) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792) [19:01:45] (03PS1) 10Andrew Bogott: Add dummy rabbitmq passwords for Magnum [labs/private] - 10https://gerrit.wikimedia.org/r/842864 (https://phabricator.wikimedia.org/T280792) [19:04:28] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add dummy rabbitmq passwords for Magnum [labs/private] - 10https://gerrit.wikimedia.org/r/842864 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [19:06:21] (03PS2) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792) [19:08:28] I'll go and deploy the config change for beta now https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842858/ [19:09:57] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:18] (03PS3) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792) [19:10:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703) (owner: 10Daniel Kinzler) [19:10:59] (03PS4) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792) [19:11:30] (03Merged) 10jenkins-bot: Beta: set $wmgVisualEditorAccessRestbaseDirectly = false for dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703) (owner: 10Daniel Kinzler) [19:14:16] (03PS5) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792) [19:15:53] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:16:26] (03PS6) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the shared 'nova' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792) [19:19:07] (03CR) 10Andrew Bogott: [C: 03+2] Magnum: use magnum-specific rabbitmq user rather than the shared 'nova' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [19:26:42] (03PS1) 10Andrew Bogott: Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792) [19:28:29] (03PS2) 10Andrew Bogott: Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792) [19:31:45] (03PS1) 10Andrew Bogott: Fix name for dummy magnum rabbit password [labs/private] - 10https://gerrit.wikimedia.org/r/842866 [19:31:55] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:32:11] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Fix name for dummy magnum rabbit password [labs/private] - 10https://gerrit.wikimedia.org/r/842866 (owner: 10Andrew Bogott) [19:32:19] (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:03] (ProbeDown) firing: (10) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:12] * jhathaway here [19:35:59] (03PS3) 10Andrew Bogott: Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792) [19:37:19] (ProbeDown) resolved: (6) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:38:03] (ProbeDown) resolved: (14) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:38:23] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:39:18] (ProbeDown) firing: (5) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:39:41] (03PS4) 10Andrew Bogott: Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792) [19:40:03] (ProbeDown) firing: (10) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:40:09] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:53] Hi, lists.wikimedia.org is down (request on / issues a 301 redirect, but /postorius/lists/ timeouts). Can someone bring it back please? [19:42:14] (03CR) 10Andrew Bogott: [C: 03+2] Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [19:42:33] (ProbeDown) resolved: (6) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:44:28] urbanecm: works for me (now) [19:44:39] urbanecm: something else is going on [19:44:40] works for me now too! [19:45:03] (ProbeDown) resolved: (10) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:45:42] yea, so the jinxer-wm messages above [19:47:43] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:54:29] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:55:34] !log oblivian@cumin1001 START - Cookbook sre.network.cf [19:55:35] !log oblivian@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [19:57:44] !log oblivian@cumin1001 START - Cookbook sre.network.cf [19:57:45] !log oblivian@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [19:59:57] (03PS1) 10Andrew Bogott: Add haproxy entry for magnum on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842869 (https://phabricator.wikimedia.org/T280792) [20:02:22] (03CR) 10Andrew Bogott: [C: 03+2] Add haproxy entry for magnum on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842869 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [20:05:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [20:06:15] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:03] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:25] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:41:13] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:09] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:48:11] !log jhathaway@cumin1001 START - Cookbook sre.network.cf [20:48:12] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [20:48:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:53:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:53:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:55:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [21:00:59] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:03:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [21:10:57] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/842398 [21:19:25] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:22:08] (03PS1) 10Dzahn: phabricator: rename rsync module for dumps [puppet] - 10https://gerrit.wikimedia.org/r/842873 (https://phabricator.wikimedia.org/T313360) [21:23:31] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:42:36] (03PS1) 10Dzahn: phabricator: move list of dumps rsync clients to parameter and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360) [21:43:33] (03CR) 10CI reject: [V: 04-1] phabricator: move list of dumps rsync clients to parameter and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [21:44:50] (03PS2) 10Dzahn: phabricator: move list of dumps rsync clients to parameter and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360) [21:59:44] (03PS1) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) [22:04:05] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:08:32] (03CR) 10Dzahn: [C: 03+2] phabricator: rename rsync module for dumps [puppet] - 10https://gerrit.wikimedia.org/r/842873 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [22:08:38] (03PS2) 10Dzahn: phabricator: rename rsync module for dumps [puppet] - 10https://gerrit.wikimedia.org/r/842873 (https://phabricator.wikimedia.org/T313360) [22:37:33] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:37:58] (03CR) 10Dzahn: "ah, right. manual cleanup not even needed. puppet does that (meanwhile)" [puppet] - 10https://gerrit.wikimedia.org/r/842873 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [22:37:59] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:38:21] (03PS3) 10Dzahn: phabricator: move list of dumps rsync clients to parameter and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360) [22:41:55] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:46:07] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:49:19] (03PS1) 10Urbanecm: Mentee filters: always use mw.user.options values to initialise the mentees store [extensions/GrowthExperiments] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/842897 (https://phabricator.wikimedia.org/T320728) [22:56:47] !log pcc-worker1003.puppet-diffs.eqiad1.wikimedia.cloud - out of disk space again - deleted 3.5GB job "1460" to unblock puppet compiling [22:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:11] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:13:42] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37572/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:18:27] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop everywhere, issues on phab1004 entirely unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:18:35] (03PS2) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) [23:33:42] (03CR) 10Dzahn: [C: 04-1] "parameter 'dumps_rsync_clients' index 4 expects a match for Stdlib::Fqdn" [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:34:05] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:37:17] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:40:42] (03PS3) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) [23:41:20] (03CR) 10CI reject: [V: 04-1] phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:41:52] (03PS4) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) [23:42:16] (03PS5) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) [23:42:55] (03CR) 10CI reject: [V: 04-1] phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:44:05] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:15] (03CR) 10Dzahn: "still parameter 'dumps_rsync_clients' index 4 expects a match for Stdlib::Fqdn. is it my syntax or how can I use the anchor/alias and not " [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:46:37] (03PS6) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) [23:48:15] (03PS7) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) [23:48:49] (03CR) 10CI reject: [V: 04-1] phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)