[00:00:08] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "I reproduced this as follows:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[00:00:21] <wikibugs>	 (03PS4) 10Tim Starling: Remove PHP 7.4 version check and prepare for title case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552)
[00:00:23] <wikibugs>	 (03PS4) 10Tim Starling: Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552)
[00:03:02] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[00:09:34] <wikibugs>	 (03PS3) 10CDanis: Re-introduce newconnrate [puppet] - 10https://gerrit.wikimedia.org/r/842539 (https://phabricator.wikimedia.org/T306580)
[00:09:56] <wikibugs>	 (03CR) 10CDanis: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1001/37549/" [puppet] - 10https://gerrit.wikimedia.org/r/842539 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[00:10:45] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:28:02] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[00:29:55] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2025-2027]
[00:32:53] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:33:02] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[00:33:39] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:33:57] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:34:56] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547
[00:35:44] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547
[00:36:36] <ryankemper>	 !log T300943 Decom'ing elastic20[25-36]. Decommissioning in batches by row, starting with row A (2025-27)
[00:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:41] <stashbot>	 T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943
[00:38:02] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[00:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:43:02] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[00:43:09] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:23] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=elastic2025*
[00:44:05] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[00:45:57] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=elastic2026.codfw.wmnet
[00:48:07] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:48:08] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[2025-2027]
[00:49:55] <ryankemper>	 !log [Elastic] `ryankemper@elastic1083:~$ sudo systemctl restart elasticsearch_7*` to clear `CirrusSearchJVMGCYoungPoolInsufficient`
[00:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:40] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2028-2030]
[00:53:02] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[01:06:43] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:13:14] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[01:14:33] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Remove PHP 7.4 version check and prepare for title case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[01:15:26] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[01:15:27] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[2028-2030]
[01:15:48] <wikibugs>	 (03Merged) 10jenkins-bot: Remove PHP 7.4 version check and prepare for title case [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842242 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling)
[01:19:42] <wikibugs>	 10ops-eqiad: eqaid: duplicate serial: - https://phabricator.wikimedia.org/T320772 (10Papaul)
[01:20:48] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/UcfirstOverrides.php: for T292552, should have no effect at this stage (duration: 03m 46s)
[01:20:53] <stashbot>	 T292552: Rename articles and users to prepare for PHP 7.3 unicode changes - https://phabricator.wikimedia.org/T292552
[01:26:43] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 03m 36s)
[01:32:14] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2031-2033].codfw.wmnet
[01:35:04] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547 (https://phabricator.wikimedia.org/T300943)
[01:37:06] <wikibugs>	 (03PS4) 10Ryan Kemper: elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547 (https://phabricator.wikimedia.org/T300943)
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:40:20] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[01:42:16] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[01:42:17] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[2031-2033].codfw.wmnet
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:57] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:58:41] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:59:18] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2034,2036].codfw.wmnet
[02:01:12] <ryankemper>	 !log T300943 Final batch of decom'ing `elastic20[25-36]` => already decommissioned rows A/B/C; starting final row D (corresponding to `203[4,6]`)
[02:01:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:01:17] <stashbot>	 T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943
[02:05:17] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: decom elastic20[25-36] [puppet] - 10https://gerrit.wikimedia.org/r/842547 (https://phabricator.wikimedia.org/T300943) (owner: 10Ryan Kemper)
[02:05:50] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:58] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[02:10:59] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic[2034,2036].codfw.wmnet
[02:11:26] <ryankemper>	 !log T300943 Decom of elastic20[25-36] complete. Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/842547. This is done
[02:11:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:11:31] <stashbot>	 T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943
[02:12:00] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[02:13:20] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[02:24:41] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config: clean up deleted file (duration: 03m 46s)
[02:54:39] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:06:09] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 110 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:07:59] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:08:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 11 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:17:09] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:22:01] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 #page on db1143 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1306.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:30:51] <icinga-wm>	 PROBLEM - DNS on elastic2026.mgmt is CRITICAL: Domain elastic2026.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:30:51] <icinga-wm>	 PROBLEM - DNS on elastic2027.mgmt is CRITICAL: Domain elastic2027.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:30:51] <icinga-wm>	 PROBLEM - DNS on elastic2028.mgmt is CRITICAL: Domain elastic2028.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:30:51] <icinga-wm>	 PROBLEM - DNS on elastic2030.mgmt is CRITICAL: Domain elastic2030.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:30:51] <icinga-wm>	 PROBLEM - DNS on elastic2029.mgmt is CRITICAL: Domain elastic2029.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:35:21] <icinga-wm>	 PROBLEM - DNS on elastic2025.mgmt is CRITICAL: Domain elastic2025.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:42:23] <logmsgbot>	 !log oblivian@cumin1001 dbctl commit (dc=all): 'depool db1143, lagging', diff saved to https://phabricator.wikimedia.org/P35485 and previous config saved to /var/cache/conftool/dbconfig/20221014-034223-oblivian.json
[03:50:53] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:34:37] <icinga-wm>	 PROBLEM - Host elastic2025 is DOWN: PING CRITICAL - Packet loss = 100%
[04:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:38:23] <icinga-wm>	 PROBLEM - Host elastic2026 is DOWN: PING CRITICAL - Packet loss = 100%
[04:40:01] <icinga-wm>	 PROBLEM - Host elastic2027 is DOWN: PING CRITICAL - Packet loss = 100%
[04:46:55] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:53:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[05:01:15] <icinga-wm>	 PROBLEM - Host elastic2028 is DOWN: PING CRITICAL - Packet loss = 100%
[05:06:51] <icinga-wm>	 PROBLEM - Host elastic2029 is DOWN: PING CRITICAL - Packet loss = 100%
[05:09:25] <icinga-wm>	 PROBLEM - Host elastic2030 is DOWN: PING CRITICAL - Packet loss = 100%
[05:24:33] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:25:39] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:27:33] <_joe_>	 uh wat
[06:27:52] <_joe_>	 we actually lost half a row of ES in codfw?
[06:27:57] <_joe_>	 why isn't this alerting
[06:29:23] <_joe_>	 ah these are machines to decom apparently
[06:29:47] <_joe_>	 they're not in manifest/site.pp anymore, making it even more confusing
[06:37:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Not working well
[06:37:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Not working well
[06:43:10] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Assuming the linked changes get approved. The logic looks good to me, but I can't mentally interpret it and see what it would look like. M" [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond)
[06:45:27] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 7843
[06:46:04] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 7843
[06:52:09] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:58:51] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221014T0700)
[07:16:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1008.eqiad.wmnet with reason: Remove from cluster for eventual decom
[07:17:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1008.eqiad.wmnet with reason: Remove from cluster for eventual decom
[07:18:34] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove ganeti role from ganeti1008 [puppet] - 10https://gerrit.wikimedia.org/r/842510 (https://phabricator.wikimedia.org/T320419)
[07:21:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti role from ganeti1008 [puppet] - 10https://gerrit.wikimedia.org/r/842510 (https://phabricator.wikimedia.org/T320419) (owner: 10Muehlenhoff)
[07:24:05] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:25:21] <ryankemper>	 _joe_: those instances have been decom'd, it looks like 5 of them are still showing up in icinga though
[07:25:38] <_joe_>	 ryankemper: yeah hence my confusion
[07:29:11] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[07:30:40] <moritzm>	 they are still in puppetdb and debmonitor, so something must have been off with the run of the decom cookbook
[07:33:58] <moritzm>	 ryankemper: it seems when running the decom cookbook partially a botched query was submitted, I'm seeing "Query 'elastic20[28-30]' did not match any host or failed" in the logs
[07:34:23] <moritzm>	 so simply re-running the cookbook should fix it
[07:36:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1005.eqiad.wmnet
[07:36:42] <ryankemper>	 moritzm: thanks, and yeah I can see I ran it like `elastic20[28-30]` instead of `elastic20[28-30]*`
[07:37:31] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2025-2027].codfw.wmnet
[07:41:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[07:43:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:43:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1005.eqiad.wmnet
[07:44:36] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:45:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1006.eqiad.wmnet
[07:51:13] <wikibugs>	 (03PS1) 10David Caro: ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693
[07:54:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 (owner: 10David Caro)
[07:54:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[07:55:32] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:56:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:56:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1006.eqiad.wmnet
[07:57:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1007.eqiad.wmnet
[08:02:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:04:45] <wikibugs>	 (03PS2) 10David Caro: ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693
[08:05:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:05:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1007.eqiad.wmnet
[08:07:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1008.eqiad.wmnet
[08:12:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:14:10] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[08:15:20] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:15:21] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic[2025-2027].codfw.wmnet
[08:15:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:15:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1008.eqiad.wmnet
[08:21:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove remaining Puppet references for ganeti1005-1008 [puppet] - 10https://gerrit.wikimedia.org/r/842694 (https://phabricator.wikimedia.org/T320419)
[08:26:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove remaining Puppet references for ganeti1005-1008 [puppet] - 10https://gerrit.wikimedia.org/r/842694 (https://phabricator.wikimedia.org/T320419) (owner: 10Muehlenhoff)
[08:29:02] <moritzm>	 !log installing git security updates on buster
[08:29:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:28] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10EChetty)
[08:31:25] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2028-2030].codfw.wmnet
[08:32:51] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 (owner: 10David Caro)
[08:35:25] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 (10MoritzMuehlenhoff) a:03Jclark-ctr These are ready for DC ops unracking tasks.
[08:37:28] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:44:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 (owner: 10David Caro)
[08:46:05] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[08:46:35] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] alerts.downtime_host: attempt to match alert hostnames with :<port> (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott)
[08:47:16] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:47:17] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic[2028-2030].codfw.wmnet
[08:47:25] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Revert "cloudbackups: run nfs backups from labstore1004 rather than 1005" [puppet] - 10https://gerrit.wikimedia.org/r/838090 (owner: 10David Caro)
[08:48:08] <wikibugs>	 (03Merged) 10jenkins-bot: ceph: remove all not needed alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/842693 (owner: 10David Caro)
[08:53:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[09:14:25] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (https://phabricator.wikimedia.org/T320782)
[09:14:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (https://phabricator.wikimedia.org/T320782) (owner: 10Giuseppe Lavagetto)
[09:22:06] <wikibugs>	 (03PS1) 10Elukey: ml-services: update revscoring Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374)
[09:23:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10MatthewVernon) [[ https://en.wikipedia.org/wiki/Louhi | Louhi ]], the shape-changing witch-queen from the Kalevala? I don't think currently in use as software-name...
[09:27:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: update revscoring Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey)
[09:27:18] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: update revscoring Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey)
[09:35:07] <wikibugs>	 (03CR) 10Jbond: wmflib::ansi: add new ansi formatting function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond)
[09:35:20] <wikibugs>	 (03PS4) 10Jbond: wmflib::ansi: add new ansi formatting function [puppet] - 10https://gerrit.wikimedia.org/r/842496
[09:36:17] <wikibugs>	 (03PS6) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696)
[09:42:36] <icinga-wm>	 PROBLEM - Dell PowerEdge RAID Controller on db1202 is CRITICAL: communication 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[09:42:37] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db1202 is CRITICAL: communication 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T320786 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[09:42:42] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10ops-monitoring-bot)
[09:45:52] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:53:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond)
[09:53:22] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:56:44] <wikibugs>	 (03CR) 10Jbond: P:netbox::host: create a motd for the status (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond)
[09:58:37] <wikibugs>	 (03PS7) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696)
[10:02:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Pick a name for the IDM - https://phabricator.wikimedia.org/T319409 (10cmooney) Charon is also the StrongSwan IKEv2 daemon: https://docs.strongswan.org/docs/5.9/daemons/charon.html  >>! In T319409#8316323, @MatthewVernon wrote: > [[ https://en.wikipedia.org/wiki/Louhi | Lou...
[10:11:29] <wikibugs>	 10SRE, 10observability, 10serviceops, 10Maps (Kartotherian): Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748 (10fgiunchedi) With my Observability/Prometheus hat on: to bridge the statsd/prometheus gap we've been deploying `profile::prometheus::statsd_exporter` e.g....
[10:13:08] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:15:24] <dcausse>	 !log Deployed patch for T320785
[10:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:52] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey)
[10:19:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: aptrepo: add trailing newline to "updates" [puppet] - 10https://gerrit.wikimedia.org/r/842703
[10:20:18] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:35] <godog>	 if anyone is up for a trivial review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/842703/
[10:22:01] <godog>	 !log upgrade grafana to 8.5.14
[10:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:18] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:31:32] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/842703 (owner: 10Filippo Giunchedi)
[10:44:52] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:58:46] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:18:29] <wikibugs>	 (03CR) 10Muehlenhoff: "This is getting quite ready! I did another pass, but most of them are smaller nits/comments." [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[11:20:20] <wikibugs>	 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Clement_Goubert) Just for confirmation before diving into it on Monday, the list of services to re-de...
[11:20:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/842498 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond)
[11:31:26] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:43:34] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >>! In T319067#8314210, @BBlack wrote: > Is it possible to fake this out with a bunch of trivially-built empty udebs that are in our r...
[11:45:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: add trailing newline to "updates" [puppet] - 10https://gerrit.wikimedia.org/r/842703 (owner: 10Filippo Giunchedi)
[11:46:04] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >>! In T319067#8314213, @ssingh wrote: > On the Traffic side, the image + cookbook patch is working for us. The only issue being -- an...
[11:53:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Peachey88)
[11:56:34] <wikibugs>	 (03PS1) 10Ladsgroup: db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842752 (https://phabricator.wikimedia.org/T320773)
[11:58:16] <wikibugs>	 (03PS1) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[11:58:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:01:34] <wikibugs>	 (03CR) 10Muehlenhoff: WIP: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:01:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1202 - Degraded RAID (T320786)', diff saved to https://phabricator.wikimedia.org/P35487 and previous config saved to /var/cache/conftool/dbconfig/20221014-120155-ladsgroup.json
[12:02:01] <stashbot>	 T320786: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786
[12:02:25] <wikibugs>	 (03PS2) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[12:02:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:04:59] <wikibugs>	 (03PS2) 10Ladsgroup: db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842752 (https://phabricator.wikimedia.org/T320773)
[12:05:06] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842752 (https://phabricator.wikimedia.org/T320773) (owner: 10Ladsgroup)
[12:06:10] <wikibugs>	 (03PS3) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[12:06:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:06:46] <wikibugs>	 (03PS1) 10Ladsgroup: db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842754 (https://phabricator.wikimedia.org/T320786)
[12:07:14] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1143: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/842754 (https://phabricator.wikimedia.org/T320786) (owner: 10Ladsgroup)
[12:07:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10MoritzMuehlenhoff)
[12:08:27] <wikibugs>	 (03PS4) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[12:08:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Implement a staging setup - https://phabricator.wikimedia.org/T320795 (10MoritzMuehlenhoff)
[12:09:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:12:31] <wikibugs>	 (03PS5) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[12:13:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:13:19] <wikibugs>	 (03CR) 10Slyngshede: WIP: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:14:23] <wikibugs>	 (03PS6) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[12:14:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:19:26] <wikibugs>	 (03PS7) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[12:19:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Initial production deployment - https://phabricator.wikimedia.org/T320797 (10MoritzMuehlenhoff)
[12:21:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:23:34] <wikibugs>	 (03PS8) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[12:24:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:24:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM integration into CAS SSO - https://phabricator.wikimedia.org/T320799 (10MoritzMuehlenhoff)
[12:25:47] <wikibugs>	 (03PS9) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[12:26:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[12:27:15] <wikibugs>	 (03PS10) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[12:27:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 3 "Build-out for self service" - https://phabricator.wikimedia.org/T320801 (10MoritzMuehlenhoff)
[12:28:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Create a mockup and involve designers - https://phabricator.wikimedia.org/T320802 (10MoritzMuehlenhoff)
[12:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:39:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Define the core attribute list managed in the IDM with all stakeholders - https://phabricator.wikimedia.org/T320805 (10MoritzMuehlenhoff)
[12:41:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Consider reusing some wiki data sources for signup/restrictions - https://phabricator.wikimedia.org/T320806 (10MoritzMuehlenhoff)
[12:42:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Implement OAuth account validation for linking an account to a wiki account - https://phabricator.wikimedia.org/T320807 (10MoritzMuehlenhoff)
[12:43:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Implement email address validation workflow - https://phabricator.wikimedia.org/T320808 (10MoritzMuehlenhoff)
[12:44:20] <wikibugs>	 (03PS1) 10Muehlenhoff: pontoon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842756 (https://phabricator.wikimedia.org/T308013)
[12:44:22] <wikibugs>	 (03PS1) 10Muehlenhoff: paws: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842757 (https://phabricator.wikimedia.org/T308013)
[12:44:24] <wikibugs>	 (03PS1) 10Muehlenhoff: wmcs::nfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013)
[12:44:26] <wikibugs>	 (03PS1) 10Muehlenhoff: kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842759 (https://phabricator.wikimedia.org/T308013)
[12:44:28] <wikibugs>	 (03PS1) 10Muehlenhoff: dumps::generation: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842760 (https://phabricator.wikimedia.org/T308013)
[12:44:30] <wikibugs>	 (03PS1) 10Muehlenhoff: kerberos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842761 (https://phabricator.wikimedia.org/T308013)
[12:44:32] <wikibugs>	 (03PS1) 10Muehlenhoff: idp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842762 (https://phabricator.wikimedia.org/T308013)
[12:44:34] <wikibugs>	 (03PS1) 10Muehlenhoff: statistics : Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842763 (https://phabricator.wikimedia.org/T308013)
[12:44:36] <wikibugs>	 (03PS1) 10Muehlenhoff: labs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842764 (https://phabricator.wikimedia.org/T308013)
[12:44:38] <wikibugs>	 (03PS1) 10Muehlenhoff: kubernetes: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842765 (https://phabricator.wikimedia.org/T308013)
[12:45:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Figure out a captcha option - https://phabricator.wikimedia.org/T320809 (10MoritzMuehlenhoff)
[12:47:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: debian: add packaging [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808
[12:53:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[13:01:37] <wikibugs>	 10SRE, 10ARM support: SRE Summit 2022 Outcome of Session "Adoption of aarch64 (aka arm64) in WMF production?" - https://phabricator.wikimedia.org/T320811 (10akosiaris)
[13:02:09] <wikibugs>	 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata) Correct!
[13:02:42] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T319425 (10Papaul) 05Open→03Resolved a:03Papaul The interface is not configure and it is disable
[13:04:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:05:02] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) a:05LSobanski→03MatthewVernon
[13:05:19] <jinxer-wm>	 (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:05:25] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:05:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page  - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[13:05:50] <jelto>	 around
[13:06:13] <_joe_>	 uh oh
[13:06:53] <jelto>	 head over to _security
[13:10:19] <jinxer-wm>	 (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:10:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page  - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[13:13:23] <wikibugs>	 10SRE, 10ARM support: SRE Summit 2022 Outcome of Session "Adoption of aarch64 (aka arm64) in WMF production?" - https://phabricator.wikimedia.org/T320811 (10akosiaris)
[13:14:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[13:18:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] wmcs::nfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:18:46] <wikibugs>	 (03PS11) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[13:19:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] kerberos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842761 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:19:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[13:23:11] <wikibugs>	 (03PS12) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428)
[13:28:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looks good, have you considered using http_fail_rate to detect that the upper layers are struggling?" [puppet] - 10https://gerrit.wikimedia.org/r/842539 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[13:34:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:37:07] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:39:33] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-10-14-080318-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/842812 (https://phabricator.wikimedia.org/T319175)
[13:43:51] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:44:09] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:45:15] <icinga-wm>	 RECOVERY - Host mw1314.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms
[13:46:05] <wikibugs>	 (03CR) 10Elukey: "This is awesome, thanks so much for doing it! I left a comment for the patch file, just to get the purpose of the preamble, the rest looks" [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 (owner: 10Filippo Giunchedi)
[13:48:20] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 (10Jclark-ctr)
[13:49:11] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[13:49:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/842697 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey)
[13:50:21] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:51:46] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1004.eqiad.wmnet - https://phabricator.wikimedia.org/T319683 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[13:53:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Two nits, looks good otherwise." [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 (owner: 10Filippo Giunchedi)
[13:55:39] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:57:12] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:57:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:58:01] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission ganeti1005/ganeti1006/ganeti1007/ganeti1008 - https://phabricator.wikimedia.org/T320419 (10Jclark-ctr) 05Open→03Resolved completed Decom process
[13:58:30] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T319682 (10Jclark-ctr) 05Open→03Resolved completed Decom
[13:58:51] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudnet1004.eqiad.wmnet - https://phabricator.wikimedia.org/T319683 (10Jclark-ctr) 05Open→03Resolved completed Decom process
[13:59:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842764 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:59:20] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:00:00] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[14:00:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Jclark-ctr)
[14:00:25] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Jclark-ctr) 05Open→03Resolved Finished Decom  process
[14:06:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:09:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:09:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:11:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:13:55] <wikibugs>	 (03PS2) 10Muehlenhoff: labs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842764 (https://phabricator.wikimedia.org/T308013)
[14:13:57] <elukey>	 this is me, we are working on this log spam from k8s :(
[14:14:10] <elukey>	 (the kafka logging too many msg etc..)
[14:16:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:17:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[14:18:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] labs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842764 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:18:59] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[28-39].eqiad.wmnet - https://phabricator.wikimedia.org/T318691 (10Jclark-ctr)
[14:19:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[14:21:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[14:21:43] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Thanks, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/842759 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:22:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[14:22:39] <wikibugs>	 (03PS1) 10Clément Goubert: Remove references to deprecated kubeyaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348)
[14:24:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842756 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:24:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:25:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] pontoon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842756 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:27:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[14:27:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:27:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[14:27:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:28:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:29:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:29:11] <icinga-wm>	 PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:29:18] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:29:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[14:29:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:30:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:31:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[14:31:47] <icinga-wm>	 PROBLEM - Host mw1314.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:32:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[14:32:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:35:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:37:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] idp: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842762 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:37:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:40:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[14:40:49] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] admin_ng: set higher circuit breaking limits for EventGate on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/842494 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey)
[14:42:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[14:42:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:42:59] <wikibugs>	 (03PS2) 10Filippo Giunchedi: debian: add packaging [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808
[14:43:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the quick reviews!" [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 (owner: 10Filippo Giunchedi)
[14:43:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[14:45:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:47:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[14:47:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[14:48:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[14:49:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:54:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:55:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it!" [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 (owner: 10Filippo Giunchedi)
[14:55:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:12:52] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T320817 (10phaultfinder)
[15:13:14] <wikibugs>	 (03PS1) 10Elukey: knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468)
[15:14:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey)
[15:17:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10dcaro) > But still fairly comfortably within the 10G NIC capcity. What throughput limits were hit? Sorry if I missed them on the dashboard you linked, I d...
[15:23:23] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/842394
[15:31:54] <wikibugs>	 (03PS2) 10Elukey: knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468)
[15:32:27] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/842395
[15:35:50] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey)
[15:37:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: reduce the default logging levels [deployment-charts] - 10https://gerrit.wikimedia.org/r/842829 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey)
[15:40:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:40:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:43:29] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:44:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:45:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[15:46:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[15:48:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[15:48:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:49:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[15:52:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) > I did some tests in the past and that was more or less the maximum network throughput I got, so I was expecting for that to be the same (thinki...
[15:53:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:14:53] <icinga-wm>	 PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[16:16:03] <icinga-wm>	 PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:09] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:18:01] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:21:41] <wikibugs>	 (03PS1) 10Cwhite: logstash: set webrequest index replicas to 0 for large indexes [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099)
[16:27:19] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:27:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "I'd like to deploy this before the next curator run at Oct 15 00:42 UTC." [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[16:30:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:30:42] <wikibugs>	 (03PS2) 10Cwhite: logstash: set webrequest index replicas to 0 for large indexes [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099)
[16:35:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:37:23] <icinga-wm>	 RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV
[16:38:01] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:38:35] <icinga-wm>	 RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:53:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[16:55:23] <wikibugs>	 (03PS1) 10Jbond: puppetdb: create small script to quer puppetdb and give a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[16:55:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: create small script to quer puppetdb and give a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond)
[16:57:18] <wikibugs>	 (03CR) 10JHathaway: wmflib::ansi: add new ansi formatting function (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond)
[16:59:37] <wikibugs>	 (03PS2) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[17:00:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond)
[17:01:36] <wikibugs>	 (03PS3) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[17:02:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond)
[17:03:00] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[17:06:17] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: set webrequest index replicas to 0 for large indexes [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[17:07:17] <icinga-wm>	 PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[17:08:25] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:11:49] <icinga-wm>	 RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[17:15:07] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:16:06] <wikibugs>	 (03PS4) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[17:16:08] <wikibugs>	 (03PS1) 10Jbond: P:puppetdb: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/842854
[17:18:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond)
[17:18:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:puppetdb: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/842854 (owner: 10Jbond)
[17:19:57] <wikibugs>	 (03PS5) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[17:23:32] <wikibugs>	 (03PS6) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[17:23:53] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:32:28] <wikibugs>	 (03PS1) 10Brennen Bearnes: gitlab runner: allow golang:* images [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825)
[17:33:41] <wikibugs>	 (03PS2) 10Brennen Bearnes: gitlab runner: allow golang:* images [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825)
[17:37:00] <wikibugs>	 (03CR) 10Addshore: [C: 03+1] gitlab runner: allow golang:* images [puppet] - 10https://gerrit.wikimedia.org/r/842857 (https://phabricator.wikimedia.org/T320825) (owner: 10Brennen Bearnes)
[17:43:58] <wikibugs>	 (03PS7) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[17:46:15] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: set webrequest index replicas to 0 for large indexes [puppet] - 10https://gerrit.wikimedia.org/r/842396 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite)
[17:47:54] <duesen>	 I'd like to deploy a config patch for beta in a bit, in an hour or so. 
[17:48:48] <duesen>	 I know it's Friday and all that... the change would allow me to test a change to VE that will be riding the train next week. Would be good if I could check it out on beta before the deployment branch.
[17:48:52] <duesen>	 Any objections?
[17:51:36] <wikibugs>	 (03PS8) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[17:55:35] <wikibugs>	 (03PS1) 10Daniel Kinzler: Beta: set $wmgVisualEditorAccessRestbaseDirectly = false for dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703)
[17:55:44] <duesen>	 This one --^
[17:56:14] <dancy>	 beta-only so no objection from me.
[17:58:04] <duesen>	 Great! I'll have dinner and then do it when my blood sugar is back to normal :)
[18:01:36] <wikibugs>	 (03CR) 10D3r1ck01: [C: 03+1] Beta: set $wmgVisualEditorAccessRestbaseDirectly = false for dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703) (owner: 10Daniel Kinzler)
[18:03:58] <dduvall>	 mutante: o/ contint agents are offline and ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/834400
[18:04:10] <dduvall>	 i've logged in -releng
[18:05:21] <dduvall>	 duesen: fine by me too :)
[18:05:37] <mutante>	 compiling the change. How about we disable puppet on contint*, then deploy to non-active one.. then to active one
[18:05:39] <dduvall>	 though you might want to get an a-ok from an sre as well
[18:06:16] <dduvall>	 mutante: that sounds good
[18:06:38] <mutante>	 confirms that 2001 is master
[18:06:40] <wikibugs>	 (03PS9) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[18:07:43] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37554/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[18:08:08] <mutante>	 !log contint* - temp disabled puppet, deploying gerrit:834400, docker version upgrade on CI servers (T318382)
[18:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:13] <stashbot>	 T318382: Upgrade docker on integration hosts for fixes to BuildKit builder - https://phabricator.wikimedia.org/T318382
[18:08:58] <mutante>	 merged, running puppet on contint1001.. disabled on 2001
[18:09:17] <mutante>	 and.. it fails 
[18:09:22] <mutante>	 E: Version '5:20.10.18~3-0~debian-buster' for 'docker-ce' was not found
[18:09:28] <dduvall>	 what
[18:09:59] <mutante>	 the puppet run can finish but it does not find the new version
[18:10:26] <mutante>	 these are buster
[18:10:26] <wikibugs>	 (03PS10) 10Jbond: puppetdb: create small script to query puppetdb for a list of changes [puppet] - 10https://gerrit.wikimedia.org/r/842850
[18:10:35] <mutante>	 it's a TODO to move them to bullseye and new hardware
[18:10:41] <mutante>	 is that why?
[18:11:07] <mutante>	 looking
[18:11:08] <dduvall>	 i think it's because reprepro didn't pull in the latest versions for buster maybe https://phabricator.wikimedia.org/T318382#8271222
[18:11:18] <dduvall>	 only bullseye
[18:11:46] <dduvall>	 we'll want those updated for buster as well... *sigh*
[18:11:53] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup1002), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[18:12:12] <dduvall>	 maybe we should just not pin the version. sorry mutante. you can revert the patch and we're suss it out
[18:12:18] <mutante>	 we can get Version: 5:20.10.12~3-0~debian-buster
[18:12:22] <dduvall>	 *we'll*
[18:12:34] <dduvall>	 that's what's currently installed, yeah
[18:13:48] <mutante>	 reading the ticket link.. ACK
[18:13:54] <mutante>	 reverting for right now, ok
[18:14:09] <dduvall>	 sorry about that. i didn't catch it in the comment
[18:14:37] <mutante>	 no problem
[18:14:53] <mutante>	 we have new hardware to replace contint*
[18:15:04] <mutante>	 let's use that to install bullseye
[18:15:31] <mutante>	 but a contint* server will probably have other stuff to solve for that
[18:16:00] <wikibugs>	 (03PS1) 10Dzahn: Revert "P:ci::docker: Upgrade docker to 20.10.18 on all CI agents" [puppet] - 10https://gerrit.wikimedia.org/r/842802
[18:16:04] <dduvall>	 yeah, that's a bigger task
[18:16:31] <mutante>	 https://phabricator.wikimedia.org/T294276
[18:16:50] <mutante>	 but that's the perfect opportunity to upgrade distro
[18:16:52] <dduvall>	 i think i'll just ask moritzm if he can pull in the newer packages for buster
[18:16:54] <mutante>	 because it means we have test hosts
[18:16:59] <mutante>	 without touching the prod CI
[18:17:04] <mutante>	 which you normally wouldnt have
[18:17:27] <mutante>	 yea, that too, for short term. +1
[18:17:39] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+1] Beta: set $wmgVisualEditorAccessRestbaseDirectly = false for dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703) (owner: 10Daniel Kinzler)
[18:17:50] <dduvall>	 we at least have a newer _enough_ docker-ce now with the buildkit fixes
[18:17:55] <dduvall>	 so that's good
[18:17:58] <mutante>	 :)
[18:18:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "P:ci::docker: Upgrade docker to 20.10.18 on all CI agents" [puppet] - 10https://gerrit.wikimedia.org/r/842802 (owner: 10Dzahn)
[18:18:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "E: Version '5:20.10.18~3-0~debian-buster' for 'docker-ce' was not found" [puppet] - 10https://gerrit.wikimedia.org/r/842802 (owner: 10Dzahn)
[18:18:50] <mutante>	 how is the cloud part doing
[18:19:02] <mutante>	 since the change and revert edited cloud.yaml too
[18:20:11] <mutante>	 ok, puppet is happy on contint1001. I am re-enabling 2001
[18:20:26] <dduvall>	 well, so that's a little funny. we have the newer package version for cloud, but not the older one
[18:20:37] <mutante>	 heh:)
[18:20:38] <dduvall>	 so i had to add a little project-level puppet to bump the version there
[18:20:49] <dduvall>	 i was hoping to take that out as soon as we deployed this change :)
[18:21:16] <mutante>	 ok. from my side: done. noop on prod CI server the whole time
[18:21:21] <dduvall>	 but the upgrade went fine. no problems with docker so far
[18:21:32] <mutante>	 puppet runs again as normal
[18:21:35] <dduvall>	 thanks, mutante! i'll re-enable the agents
[18:21:42] <mutante>	 yw, yep
[18:21:54] <mutante>	 and sounds good about the upgrade
[18:25:03] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:38:35] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:43:05] <wikibugs>	 (03CR) 10Jbond: wmflib::ansi: add new ansi formatting function (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842496 (owner: 10Jbond)
[18:44:11] <icinga-wm>	 PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[18:45:13] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:46:17] <icinga-wm>	 RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[18:47:11] <wikibugs>	 (03PS1) 10Andrew Bogott: magnum: use rabbitmq_node rather than openstack_controller for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842862 (https://phabricator.wikimedia.org/T309407)
[18:49:26] <wikibugs>	 (03PS2) 10Andrew Bogott: magnum: use rabbitmq_node rather than openstack_controller for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842862 (https://phabricator.wikimedia.org/T309407)
[18:52:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] magnum: use rabbitmq_node rather than openstack_controller for rabbit [puppet] - 10https://gerrit.wikimedia.org/r/842862 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott)
[19:00:27] <wikibugs>	 (03PS1) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792)
[19:01:45] <wikibugs>	 (03PS1) 10Andrew Bogott: Add dummy rabbitmq passwords for Magnum [labs/private] - 10https://gerrit.wikimedia.org/r/842864 (https://phabricator.wikimedia.org/T280792)
[19:04:28] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add dummy rabbitmq passwords for Magnum [labs/private] - 10https://gerrit.wikimedia.org/r/842864 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott)
[19:06:21] <wikibugs>	 (03PS2) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792)
[19:08:28] <duesen>	 I'll go and deploy the config change for beta now https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842858/ 
[19:09:57] <icinga-wm>	 PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:10:18] <wikibugs>	 (03PS3) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792)
[19:10:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703) (owner: 10Daniel Kinzler)
[19:10:59] <wikibugs>	 (03PS4) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792)
[19:11:30] <wikibugs>	 (03Merged) 10jenkins-bot: Beta: set $wmgVisualEditorAccessRestbaseDirectly = false for dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842858 (https://phabricator.wikimedia.org/T320703) (owner: 10Daniel Kinzler)
[19:14:16] <wikibugs>	 (03PS5) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792)
[19:15:53] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:16:26] <wikibugs>	 (03PS6) 10Andrew Bogott: Magnum: use magnum-specific rabbitmq user rather than the shared 'nova' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792)
[19:19:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Magnum: use magnum-specific rabbitmq user rather than the shared 'nova' [puppet] - 10https://gerrit.wikimedia.org/r/842863 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott)
[19:26:42] <wikibugs>	 (03PS1) 10Andrew Bogott: Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792)
[19:28:29] <wikibugs>	 (03PS2) 10Andrew Bogott: Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792)
[19:31:45] <wikibugs>	 (03PS1) 10Andrew Bogott: Fix name for dummy magnum rabbit password [labs/private] - 10https://gerrit.wikimedia.org/r/842866
[19:31:55] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:32:11] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Fix name for dummy magnum rabbit password [labs/private] - 10https://gerrit.wikimedia.org/r/842866 (owner: 10Andrew Bogott)
[19:32:19] <jinxer-wm>	 (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:33:03] <jinxer-wm>	 (ProbeDown) firing: (10) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:33:12] * jhathaway here
[19:35:59] <wikibugs>	 (03PS3) 10Andrew Bogott: Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792)
[19:37:19] <jinxer-wm>	 (ProbeDown) resolved: (6) Service text-https:443 has failed probes (http_text-https_ip6) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:38:03] <jinxer-wm>	 (ProbeDown) resolved: (14) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:38:23] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:39:18] <jinxer-wm>	 (ProbeDown) firing: (5) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:39:41] <wikibugs>	 (03PS4) 10Andrew Bogott: Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792)
[19:40:03] <jinxer-wm>	 (ProbeDown) firing: (10) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:40:09] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:40:53] <urbanecm>	 Hi, lists.wikimedia.org is down (request on / issues a 301 redirect, but /postorius/lists/ timeouts). Can someone bring it back please?
[19:42:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add OpenStack Magnum to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842865 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott)
[19:42:33] <jinxer-wm>	 (ProbeDown) resolved: (6) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:44:28] <mutante>	 urbanecm: works for me (now)
[19:44:39] <mutante>	 urbanecm: something else is going on
[19:44:40] <urbanecm>	 works for me now too!
[19:45:03] <jinxer-wm>	 (ProbeDown) resolved: (10) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:45:42] <mutante>	 yea, so the jinxer-wm messages above 
[19:47:43] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:54:29] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:55:34] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.network.cf
[19:55:35] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[19:57:44] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.network.cf
[19:57:45] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[19:59:57] <wikibugs>	 (03PS1) 10Andrew Bogott: Add haproxy entry for magnum on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842869 (https://phabricator.wikimedia.org/T280792)
[20:02:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add haproxy entry for magnum on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/842869 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott)
[20:05:01] <jinxer-wm>	 (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[20:06:15] <icinga-wm>	 RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:33:03] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:34:25] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:41:13] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:42:09] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:48:11] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.network.cf
[20:48:12] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[20:48:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:53:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[20:53:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:55:01] <jinxer-wm>	 (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[21:00:59] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:03:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[21:10:57] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/842398
[21:19:25] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:22:08] <wikibugs>	 (03PS1) 10Dzahn: phabricator: rename rsync module for dumps [puppet] - 10https://gerrit.wikimedia.org/r/842873 (https://phabricator.wikimedia.org/T313360)
[21:23:31] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:42:36] <wikibugs>	 (03PS1) 10Dzahn: phabricator: move list of dumps rsync clients to parameter and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360)
[21:43:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: move list of dumps rsync clients to parameter and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[21:44:50] <wikibugs>	 (03PS2) 10Dzahn: phabricator: move list of dumps rsync clients to parameter and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360)
[21:59:44] <wikibugs>	 (03PS1) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360)
[22:04:05] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:08:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: rename rsync module for dumps [puppet] - 10https://gerrit.wikimedia.org/r/842873 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[22:08:38] <wikibugs>	 (03PS2) 10Dzahn: phabricator: rename rsync module for dumps [puppet] - 10https://gerrit.wikimedia.org/r/842873 (https://phabricator.wikimedia.org/T313360)
[22:37:33] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:37:58] <wikibugs>	 (03CR) 10Dzahn: "ah, right. manual cleanup not even needed. puppet does that (meanwhile)" [puppet] - 10https://gerrit.wikimedia.org/r/842873 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[22:37:59] <icinga-wm>	 PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:38:21] <wikibugs>	 (03PS3) 10Dzahn: phabricator: move list of dumps rsync clients to parameter and Hiera [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360)
[22:41:55] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:46:07] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:49:19] <wikibugs>	 (03PS1) 10Urbanecm: Mentee filters: always use mw.user.options values to initialise the mentees store [extensions/GrowthExperiments] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/842897 (https://phabricator.wikimedia.org/T320728)
[22:56:47] <mutante>	 !log pcc-worker1003.puppet-diffs.eqiad1.wikimedia.cloud - out of disk space again - deleted 3.5GB job "1460" to unblock puppet compiling
[22:56:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:11] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:13:42] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37572/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[23:18:27] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop everywhere, issues on phab1004 entirely unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/842875 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[23:18:35] <wikibugs>	 (03PS2) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360)
[23:33:42] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "parameter 'dumps_rsync_clients' index 4 expects a match for Stdlib::Fqdn" [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[23:34:05] <icinga-wm>	 RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:37:17] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:40:42] <wikibugs>	 (03PS3) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360)
[23:41:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[23:41:52] <wikibugs>	 (03PS4) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360)
[23:42:16] <wikibugs>	 (03PS5) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360)
[23:42:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[23:44:05] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:46:15] <wikibugs>	 (03CR) 10Dzahn: "still parameter 'dumps_rsync_clients' index 4 expects a match for Stdlib::Fqdn. is it my syntax or how can I use the anchor/alias and not " [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[23:46:37] <wikibugs>	 (03PS6) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360)
[23:48:15] <wikibugs>	 (03PS7) 10Dzahn: phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360)
[23:48:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: use anchor/alias to add phab servers to dump clients list [puppet] - 10https://gerrit.wikimedia.org/r/842878 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)