[00:00:30] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:38] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:19:16] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:06] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:04] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:00:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:31] <wikibugs>	 10Data-Engineering: Add Active editors by country for Wikidata to stats.wikimedia.org - https://phabricator.wikimedia.org/T328999 (10Lectrician1)
[01:02:08] <wikibugs>	 10Data-Engineering: Add Active editors by country for Wikidata to stats.wikimedia.org - https://phabricator.wikimedia.org/T328999 (10Lectrician1) Related: T266643 T265510
[01:05:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:16] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:45:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:03:38] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:27:29] <wikibugs>	 (03PS1) 10Chad: Drop vestiges of git-fat [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/887000 (https://phabricator.wikimedia.org/T328473)
[03:30:36] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:36:00] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:30:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:35:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:15:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:20:20] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:45:18] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:46] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:35:24] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:06:04] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:15:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:20:28] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:05:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:29] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:29] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:34:44] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) Staging the new version on the switches: `asw-a-codfw> request system software add force-host set [ /var/tmp/jinstall-ex-...
[08:45:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:49:55] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:49] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:53:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[08:54:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2028%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[08:58:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[08:58:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[08:58:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[08:59:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:03:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:03:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:04:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (4) varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:09:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (6) varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:09:57] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp5019 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:11:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:13:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:14:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (6) varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:15:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:15:57] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (4) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:16:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:17:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp6009 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6009%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:18:43] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10gmodena) I archived the [mediawiki-stream-enrichment](https://gitlab.wikimedia.org/repos/data-e...
[09:19:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp6010 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6010%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:19:57] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:20:47] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:20:57] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:27:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp6012 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6012%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:27:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (6) varnishkafka on cp2041 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:31:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp6002 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:31:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp6002 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:31:56] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (6) varnishkafka on cp2041 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:32:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (6) varnishkafka on cp5022 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:34:58] <wikibugs>	 10Quarry: GoogleDocs bot has download 125 000 csv exports in the last month - https://phabricator.wikimedia.org/T197256 (10taavi)
[09:35:56] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp5031 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:36:41] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (5) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:37:41] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:40:41] <wikibugs>	 10Quarry, 10cloud-services-team (FY2022/2023-Q3): Consider moving Quarry to be an installation of Redash - https://phabricator.wikimedia.org/T169452 (10taavi)
[09:41:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp3054 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp3054%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:41:41] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:41:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:41:42] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10KCVelaga_WMF) @JAnstee_WMF  I realized that we are already considering growth, however, the column title is slightly confusing  calculation for growth in SQL query  ` connectivity_index / lag(connectivit...
[09:41:56] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:42:56] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10elukey) Created some docs to implement and test the new stream in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Streams_...
[09:42:57] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:45:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:46:56] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (5) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:47:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp1077 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:47:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp1077 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:47:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:47:56] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:47:57] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:48:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp3055 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:49:53] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:51:56] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (7) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:53:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (8) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:56:00] <elukey>	 hi folks!
[09:56:13] <btullis>	 Hi elukey :-)
[09:56:22] <elukey>	 filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/887285 to install a couple of packages to stat100x boxes (should be hopefully for a limited amount of time)
[09:56:56] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (7) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:57:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (8) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:59:42] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:00:21] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:00:48] <btullis>	 elukey: Seems fine to me, but is it worth checking this? From here: https://packages.debian.org/bullseye/ocl-icd-libopencl1
[10:00:48] <btullis>	 > This package contains an installable client driver loader (ICD Loader) library that can be used to load any (free or non-free) installable client driver (ICD) for OpenCL.
[10:01:01] <btullis>	 Will we be using any non-free ICDs?
[10:01:56] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (8) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:02:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp3060 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:02:56] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (9) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:03:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1081%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:04:37] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:04:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (7) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:05:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:56] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (8) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:08:42] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (7) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:15:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:17:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:19:24] <elukey>	 btullis: sorry just seen the ping, not that I know 
[10:20:04] <elukey>	 but I'll ask to the content translation team to check, thanks for the reference :)
[10:21:56] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:24:47] <wikibugs>	 10Analytics-Radar, 10Data-Services: Discuss labsdb visibility of rev_text_id and ar_comment - https://phabricator.wikimedia.org/T158166 (10taavi)
[10:29:27] <elukey>	 btullis: ah snap for the test the team just realized that we'd need py3.9, any plans to upgrade the stat100[5,8] nodes any time soon?
[10:38:05] <elukey>	 (to bullseye I meant)
[10:41:39] <btullis>	 Yes, definite plans. No firm dates yet. stat1010 is installed with bullseye, but still `insetup::data_engineering` - Maybe I can push through a change to put this into service quickly?
[10:42:14] <elukey>	 btullis: stat1009 is also on bullseye so it is ok for cpu-only tests, but it doesn't have the GPU :(
[10:42:28] <elukey>	 this is why I was asking for 1008/1005
[10:43:03] <btullis>	 Oh yeah, sorry. Forgot. OK, time to prioritise the upgrade then, I suppose. We have planning later today and we have 3 SREs on the team now :-)
[10:43:28] <elukey>	 \o/
[10:43:29] <elukey>	 thanks a lot
[10:45:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:47:29] <btullis>	 elukey: Will `conda install python=3.9?` work for this requirement in the short term?
[10:47:33] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:48:55] <elukey>	 btullis: ah wait I didn't think about it, super ignorant about conda.. is it so magical?
[10:50:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:52:22] <btullis>	 I found this comment while farming bullseye upgrade tickets: https://phabricator.wikimedia.org/T288804#7683776
[10:52:29] <btullis>	 I found this comment while farming bullseye upgrade tickets: https://phabricator.wikimedia.org/T288804#7683776elukey: ^
[10:52:42] <btullis>	 elukey: ^^ sorry, fat fingers.
[10:53:22] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) Adding to the planning board for discussion.
[10:53:41] <elukey>	 btullis: thanks a lot! will report the finding <3
[11:09:50] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07), 10Product-Analytics (Kanban): Include EU Registered Country in the canonical country database - https://phabricator.wikimedia.org/T324995 (10EChetty) 05Open→03Resolved Thanks @nshahquinn-wmf - looks good
[11:10:31] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07), 10Patch-For-Review: Update sqoop for CheckUser table - https://phabricator.wikimedia.org/T326330 (10EChetty) 05Open→03Resolved
[11:10:39] <wikibugs>	 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10EChetty)
[11:15:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:20:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:30:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:39] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jbond)
[11:33:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:37] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis)
[12:10:28] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis)
[12:17:54] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron)
[12:22:38] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 08): [Airflow] Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10EChetty)
[12:25:31] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines (Sprint 08), 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Puppet code to make Airflow configuration files compatible with version 2.3.4 - https://phabricator.wikimedia.org/T315580 (10EChetty)
[12:26:32] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ssingh)
[12:28:47] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines: When moving oozie webrequest-load to airflow/spark avoid the error-check corner case - https://phabricator.wikimedia.org/T324757 (10EChetty)
[12:30:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:35:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:39:16] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez)
[12:41:02] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) To depool all services in codfw we will just need to run:  ` sudo cookbook sre.discovery.datacenter-route --reason 'T327925'...
[12:43:53] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:46:09] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) Please note: this won't depool `docker-registry`, which will still be active in codfw for the duration of the maintenance.
[13:00:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:41] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:55] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:15:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:26:48] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo)
[13:30:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:33] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:51] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) For the record, full row hosts downtime done with: `sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row A upgrade" -...
[13:34:18] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=295bf4d5-8856-488b-9ca9-06a0ff06db18) set by ayounsi@cumin1001 fo...
[13:44:15] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Event-Platform Value Stream (Sprint 08): Add dse k8s networks to puppet network constants - https://phabricator.wikimedia.org/T328447 (10JArguello-WMF) 05Open→03Resolved
[13:44:23] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Flink docker image should work with pyflink - https://phabricator.wikimedia.org/T327494 (10JArguello-WMF) 05Open→03Resolved
[13:44:32] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deployment pipeline docker image of flink mediawiki stream enrichment pyhon - https://phabricator.wikimedia.org/T326731 (10JArguello-WMF) 05Open→03Resolved
[13:44:35] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Productionize PyFlink Enrichment Service - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF)
[13:45:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:56:23] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:57:50] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) @elukey @achou as noted in https://phabricator.wikimedia.org/T301878#8008932, it would be better if new streams like this were...
[13:59:08] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Productionize PyFlink Enrichment Service - https://phabricator.wikimedia.org/T325303 (10Ottomata)
[14:15:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:47] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:54] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10elukey) @Ottomata sure it shouldn't be a big problem, is there an ETA for the page_change stream to be live? (just to figure out how muc...
[14:28:45] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:30:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:19] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:13] <wikibugs>	 (03PS6) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072)
[14:59:25] <wikibugs>	 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10EChetty)
[15:00:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:17] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): eventutilities-python should support nested row type info - https://phabricator.wikimedia.org/T327900 (10gmodena)
[15:10:57] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Remove hardcoded kafka parameters - https://phabricator.wikimedia.org/T329061 (10gmodena)
[15:14:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:14:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:15:43] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:19:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[15:22:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:04] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10BTullis) 05Open→03Resolved I think that we should resolve this ticket and carry out the problem solving on {T325809} instead....
[15:24:22] <wikibugs>	 10Data-Engineering-Planning: Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10BTullis) p:05Triage→03High
[15:28:52] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Clement_Goubert)
[15:30:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:34:59] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 (10Ottomata)
[15:35:12] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 (10Ottomata) p:05Triage→03Unbreak!
[15:39:27] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) 05Open→03Resolved a:03ayounsi The upgrade was smooth, ~15min hard downtime. No user impact, all the depools did the...
[15:46:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:59] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10colewhite)
[15:51:35] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10BTullis) I'm bringing this ticket into the current #shared-data-infrastructure sprint. @Stevemunene and @nfraison and I will f...
[15:51:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:44] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10colewhite)
[16:05:59] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Automated event stream throughput alerting for important state change streams - https://phabricator.wikimedia.org/T329070 (10Ottomata)
[16:06:59] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30), 10Patch-For-Review: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 (10Ottomata) Incident report drafting [[ https://docs.google.com/document/d/156gE_FD3qu67Mbumut-exlatRuFtib...
[16:15:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:52] <wikibugs>	 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) p:05Triage→03Medium
[16:20:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:56] <wikibugs>	 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi)
[16:25:23] <wikibugs>	 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi)
[16:30:03] <wikibugs>	 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 8 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite)
[16:30:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:47:37] <wikibugs>	 (03CR) 10Aqu: Remove Guava from dependency (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu)
[16:47:50] <wikibugs>	 (03CR) 10Aqu: Remove Guava from dependency (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu)
[16:51:09] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) Annnnd we're done with schema!  Latest changes are now being produced to kafka jumbo in the r...
[16:58:57] <wikibugs>	 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10BTullis)
[17:00:52] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:06:02] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:06:29] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Infrastructure-Foundations, 10CAS-SSO: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis)
[17:06:47] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10Infrastructure-Foundations, 10CAS-SSO: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis)
[17:10:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[17:11:38] <wikibugs>	 (03PS1) 10Snwachukwu: Update Webrequest table to include referer_data column. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/887371 (https://phabricator.wikimedia.org/T327074)
[17:15:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[17:17:54] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:45:53] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:53] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:53:40] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics, and 6 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10SNowick_WMF)
[18:00:45] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:04:53] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Wikipedia-iOS-App-Backlog, and 6 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10SNowick_WMF)
[18:05:39] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:09:45] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:50:51] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:59:31] <mforns>	 milimetric: please go ahead with refinery-source deployment, I can change the code quickly, but I'll need some time to retest it and the corresponding Airflow DAG an'all...
[18:59:47] <mforns>	 I think I will do an extra deployment tomorrow before meetings
[19:16:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:22:16] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:53:37] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:15:15] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:20:27] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:25:45] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:45:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:48:45] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:05:03] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:26:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:31:40] <wikibugs>	 10Data-Engineering, 10Product-Analytics: 13 new wikis missing from mediawiki_history - https://phabricator.wikimedia.org/T329119 (10nshahquinn-wmf)
[21:55:03] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) live on all wikis: end of quarter if all goes well.  live with any reliability promises: TBD
[21:57:21] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) Command `cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100 --cluster eqiad --group C --network a...
[22:28:02] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) OK, the VM is responsive at console. SSH keys have not made it into our fingerprint server, so I can't lo...
[22:38:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2033%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[22:43:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2033%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[22:49:22] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10BTullis) @bking - I believe that you can run `wmf-update-known-hosts-production` (available via this package htt...
[23:14:27] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:37:19] <wikibugs>	 (03PS1) 10Krinkle: Remove elementtiming,firstinputtiming,layoutshift,resourcetiming,rumspeedindex [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/887425 (https://phabricator.wikimedia.org/T281103)