[00:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:38] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:19:16] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:06] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:04] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:31] 10Data-Engineering: Add Active editors by country for Wikidata to stats.wikimedia.org - https://phabricator.wikimedia.org/T328999 (10Lectrician1) [01:02:08] 10Data-Engineering: Add Active editors by country for Wikidata to stats.wikimedia.org - https://phabricator.wikimedia.org/T328999 (10Lectrician1) Related: T266643 T265510 [01:05:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:16] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:38] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:27:29] (03PS1) 10Chad: Drop vestiges of git-fat [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/887000 (https://phabricator.wikimedia.org/T328473) [03:30:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:20:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:44] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) Staging the new version on the switches: `asw-a-codfw> request system software add force-host set [ /var/tmp/jinstall-ex-... [08:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:49] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:53:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [08:54:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2028%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [08:58:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [08:58:42] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [08:58:42] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [08:59:12] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:03:42] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:03:42] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:04:12] (VarnishkafkaNoMessages) firing: (4) varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:09:12] (VarnishkafkaNoMessages) firing: (6) varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:09:57] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp5019 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:11:42] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:13:42] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:14:12] (VarnishkafkaNoMessages) resolved: (6) varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:15:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:57] (VarnishkafkaNoMessages) firing: (4) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:16:42] (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:17:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp6009 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6009%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:18:43] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10gmodena) I archived the [mediawiki-stream-enrichment](https://gitlab.wikimedia.org/repos/data-e... [09:19:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp6010 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6010%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:19:57] (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:20:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:57] (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:27:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp6012 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6012%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:27:41] (VarnishkafkaNoMessages) firing: (6) varnishkafka on cp2041 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:31:41] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp6002 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:31:41] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp6002 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:31:56] (VarnishkafkaNoMessages) resolved: (6) varnishkafka on cp2041 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:32:41] (VarnishkafkaNoMessages) firing: (6) varnishkafka on cp5022 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:34:58] 10Quarry: GoogleDocs bot has download 125 000 csv exports in the last month - https://phabricator.wikimedia.org/T197256 (10taavi) [09:35:56] (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp5031 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:36:41] (VarnishkafkaNoMessages) firing: (5) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:37:41] (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:40:41] 10Quarry, 10cloud-services-team (FY2022/2023-Q3): Consider moving Quarry to be an installation of Redash - https://phabricator.wikimedia.org/T169452 (10taavi) [09:41:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp3054 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp3054%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:41:41] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:41:42] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:41:42] 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10KCVelaga_WMF) @JAnstee_WMF I realized that we are already considering growth, however, the column title is slightly confusing calculation for growth in SQL query ` connectivity_index / lag(connectivit... [09:41:56] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:42:56] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10elukey) Created some docs to implement and test the new stream in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Streams_... [09:42:57] (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp3050 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:45:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:56] (VarnishkafkaNoMessages) firing: (5) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:47:42] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp1077 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:47:42] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp1077 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:47:42] (VarnishkafkaNoMessages) resolved: varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:47:56] (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:47:57] (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:48:42] (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp3055 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:49:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:56] (VarnishkafkaNoMessages) firing: (7) varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:53:42] (VarnishkafkaNoMessages) firing: (8) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:56:00] hi folks! [09:56:13] Hi elukey :-) [09:56:22] filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/887285 to install a couple of packages to stat100x boxes (should be hopefully for a limited amount of time) [09:56:56] (VarnishkafkaNoMessages) resolved: (7) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:57:42] (VarnishkafkaNoMessages) firing: (8) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [09:59:42] (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:00:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:48] elukey: Seems fine to me, but is it worth checking this? From here: https://packages.debian.org/bullseye/ocl-icd-libopencl1 [10:00:48] > This package contains an installable client driver loader (ICD Loader) library that can be used to load any (free or non-free) installable client driver (ICD) for OpenCL. [10:01:01] Will we be using any non-free ICDs? [10:01:56] (VarnishkafkaNoMessages) firing: (8) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:02:42] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp3060 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:02:56] (VarnishkafkaNoMessages) firing: (9) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:03:42] (VarnishkafkaNoMessages) resolved: varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1081%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:04:37] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:04:42] (VarnishkafkaNoMessages) resolved: (7) varnishkafka on cp1084 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:05:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:56] (VarnishkafkaNoMessages) firing: (8) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:08:42] (VarnishkafkaNoMessages) resolved: (7) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:15:21] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:17:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:19:24] btullis: sorry just seen the ping, not that I know [10:20:04] but I'll ask to the content translation team to check, thanks for the reference :) [10:21:56] (VarnishkafkaNoMessages) resolved: varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:24:47] 10Analytics-Radar, 10Data-Services: Discuss labsdb visibility of rev_text_id and ar_comment - https://phabricator.wikimedia.org/T158166 (10taavi) [10:29:27] btullis: ah snap for the test the team just realized that we'd need py3.9, any plans to upgrade the stat100[5,8] nodes any time soon? [10:38:05] (to bullseye I meant) [10:41:39] Yes, definite plans. No firm dates yet. stat1010 is installed with bullseye, but still `insetup::data_engineering` - Maybe I can push through a change to put this into service quickly? [10:42:14] btullis: stat1009 is also on bullseye so it is ok for cpu-only tests, but it doesn't have the GPU :( [10:42:28] this is why I was asking for 1008/1005 [10:43:03] Oh yeah, sorry. Forgot. OK, time to prioritise the upgrade then, I suppose. We have planning later today and we have 3 SREs on the team now :-) [10:43:28] \o/ [10:43:29] thanks a lot [10:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:29] elukey: Will `conda install python=3.9?` work for this requirement in the short term? [10:47:33] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:48:55] btullis: ah wait I didn't think about it, super ignorant about conda.. is it so magical? [10:50:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:22] I found this comment while farming bullseye upgrade tickets: https://phabricator.wikimedia.org/T288804#7683776 [10:52:29] I found this comment while farming bullseye upgrade tickets: https://phabricator.wikimedia.org/T288804#7683776elukey: ^ [10:52:42] elukey: ^^ sorry, fat fingers. [10:53:22] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) Adding to the planning board for discussion. [10:53:41] btullis: thanks a lot! will report the finding <3 [11:09:50] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07), 10Product-Analytics (Kanban): Include EU Registered Country in the canonical country database - https://phabricator.wikimedia.org/T324995 (10EChetty) 05Open→03Resolved Thanks @nshahquinn-wmf - looks good [11:10:31] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07), 10Patch-For-Review: Update sqoop for CheckUser table - https://phabricator.wikimedia.org/T326330 (10EChetty) 05Open→03Resolved [11:10:39] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10EChetty) [11:15:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:39] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jbond) [11:33:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:37] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [12:10:28] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [12:17:54] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [12:22:38] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 08): [Airflow] Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10EChetty) [12:25:31] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 08), 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Puppet code to make Airflow configuration files compatible with version 2.3.4 - https://phabricator.wikimedia.org/T315580 (10EChetty) [12:26:32] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ssingh) [12:28:47] 10Data-Engineering-Planning, 10Data Pipelines: When moving oozie webrequest-load to airflow/spark avoid the error-check corner case - https://phabricator.wikimedia.org/T324757 (10EChetty) [12:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:16] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez) [12:41:02] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) To depool all services in codfw we will just need to run: ` sudo cookbook sre.discovery.datacenter-route --reason 'T327925'... [12:43:53] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:46:09] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) Please note: this won't depool `docker-registry`, which will still be active in codfw for the duration of the maintenance. [13:00:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:55] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:15:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:48] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [13:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:51] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) For the record, full row hosts downtime done with: `sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row A upgrade" -... [13:34:18] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=295bf4d5-8856-488b-9ca9-06a0ff06db18) set by ayounsi@cumin1001 fo... [13:44:15] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Event-Platform Value Stream (Sprint 08): Add dse k8s networks to puppet network constants - https://phabricator.wikimedia.org/T328447 (10JArguello-WMF) 05Open→03Resolved [13:44:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Flink docker image should work with pyflink - https://phabricator.wikimedia.org/T327494 (10JArguello-WMF) 05Open→03Resolved [13:44:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Deployment pipeline docker image of flink mediawiki stream enrichment pyhon - https://phabricator.wikimedia.org/T326731 (10JArguello-WMF) 05Open→03Resolved [13:44:35] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Productionize PyFlink Enrichment Service - https://phabricator.wikimedia.org/T325303 (10JArguello-WMF) [13:45:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:23] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:57:50] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) @elukey @achou as noted in https://phabricator.wikimedia.org/T301878#8008932, it would be better if new streams like this were... [13:59:08] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Productionize PyFlink Enrichment Service - https://phabricator.wikimedia.org/T325303 (10Ottomata) [14:15:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:54] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10elukey) @Ottomata sure it shouldn't be a big problem, is there an ETA for the page_change stream to be live? (just to figure out how muc... [14:28:45] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:13] (03PS6) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [14:59:25] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10EChetty) [15:00:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:17] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): eventutilities-python should support nested row type info - https://phabricator.wikimedia.org/T327900 (10gmodena) [15:10:57] 10Data-Engineering, 10Event-Platform Value Stream: Remove hardcoded kafka parameters - https://phabricator.wikimedia.org/T329061 (10gmodena) [15:14:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:14:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:15:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:19:12] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:22:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:04] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10BTullis) 05Open→03Resolved I think that we should resolve this ticket and carry out the problem solving on {T325809} instead.... [15:24:22] 10Data-Engineering-Planning: Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10BTullis) p:05Triage→03High [15:28:52] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Clement_Goubert) [15:30:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:59] 10Data-Engineering, 10Event-Platform Value Stream: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 (10Ottomata) [15:35:12] 10Data-Engineering, 10Event-Platform Value Stream: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 (10Ottomata) p:05Triage→03Unbreak! [15:39:27] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) 05Open→03Resolved a:03ayounsi The upgrade was smooth, ~15min hard downtime. No user impact, all the depools did the... [15:46:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:59] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10colewhite) [15:51:35] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10BTullis) I'm bringing this ticket into the current #shared-data-infrastructure sprint. @Stevemunene and @nfraison and I will f... [15:51:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:44] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10colewhite) [16:05:59] 10Data-Engineering, 10Event-Platform Value Stream: Automated event stream throughput alerting for important state change streams - https://phabricator.wikimedia.org/T329070 (10Ottomata) [16:06:59] 10Data-Engineering, 10Event-Platform Value Stream, 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30), 10Patch-For-Review: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 (10Ottomata) Incident report drafting [[ https://docs.google.com/document/d/156gE_FD3qu67Mbumut-exlatRuFtib... [16:15:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:52] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) p:05Triage→03Medium [16:20:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:56] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [16:25:23] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [16:30:03] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 8 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite) [16:30:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:37] (03CR) 10Aqu: Remove Guava from dependency (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [16:47:50] (03CR) 10Aqu: Remove Guava from dependency (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [16:51:09] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) Annnnd we're done with schema! Latest changes are now being produced to kafka jumbo in the r... [16:58:57] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10BTullis) [17:00:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:29] 10Data-Engineering, 10Data-Catalog, 10Infrastructure-Foundations, 10CAS-SSO: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) [17:06:47] 10Data-Engineering, 10Data-Catalog, 10Infrastructure-Foundations, 10CAS-SSO: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10BTullis) [17:10:12] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:11:38] (03PS1) 10Snwachukwu: Update Webrequest table to include referer_data column. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/887371 (https://phabricator.wikimedia.org/T327074) [17:15:12] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:17:54] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:45:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:40] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics, and 6 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10SNowick_WMF) [18:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:53] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Wikipedia-iOS-App-Backlog, and 6 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10SNowick_WMF) [18:05:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:45] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:50:51] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:59:31] milimetric: please go ahead with refinery-source deployment, I can change the code quickly, but I'll need some time to retest it and the corresponding Airflow DAG an'all... [18:59:47] I think I will do an extra deployment tomorrow before meetings [19:16:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:16] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:53:37] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:15:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:45] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:26:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:40] 10Data-Engineering, 10Product-Analytics: 13 new wikis missing from mediawiki_history - https://phabricator.wikimedia.org/T329119 (10nshahquinn-wmf) [21:55:03] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) live on all wikis: end of quarter if all goes well. live with any reliability promises: TBD [21:57:21] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) Command `cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100 --cluster eqiad --group C --network a... [22:28:02] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) OK, the VM is responsive at console. SSH keys have not made it into our fingerprint server, so I can't lo... [22:38:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2033%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:43:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2033%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [22:49:22] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10BTullis) @bking - I believe that you can run `wmf-update-known-hosts-production` (available via this package htt... [23:14:27] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:37:19] (03PS1) 10Krinkle: Remove elementtiming,firstinputtiming,layoutshift,resourcetiming,rumspeedindex [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/887425 (https://phabricator.wikimedia.org/T281103)