[00:10:45] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:54:39] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:17:09] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:50:53] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:46:55] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:23:44] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 02), 10Spike: [SPIKE] Build simple stateless service using Flink SQL - https://phabricator.wikimedia.org/T318856 (10gmodena) A summary of this spike, and evaluation of the approach, can be found at https://www.mediawiki.org/wiki/Platform_Engineering_T... [07:24:05] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:44:36] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:37:28] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:39:01] 10Data-Engineering, 10API Platform, 10GraphQL, 10Pageviews-API: Responses on pageview API should be lighter - https://phabricator.wikimedia.org/T145935 (10EChetty) [08:40:36] 10Analytics-Radar, 10Data-Engineering, 10API Platform, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10EChetty) [09:09:26] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:20:12] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:22:03] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 02), 10Spike: [SPIKE] Build simple stateless service using Flink SQL - https://phabricator.wikimedia.org/T318856 (10gmodena) >>! In T318856#8314516, @Ottomata wrote: > This is SO COOL. (btw, no code in https://gitlab.wikimedia.org/gmodena/flink-media... [09:45:52] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:18:28] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:20:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:11] btullis: I'm having an issue with kerberos - have we done done anything is that realm lately? [10:29:23] btullis: oh, and hi - my apologizes [10:31:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:14] nevermid btullis - I managed to make it work - I had a corrupted ticket it seems - no idea why [10:40:56] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:20:20] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10SRE, 10serviceops, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Clement_Goubert) Just for confirmation before diving into it on Monday, the list of services to re-deploy is: ` deployment-c... [13:02:09] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10SRE, 10serviceops, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata) Correct! [13:03:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2035%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [13:05:25] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:08:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2035 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2035%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [13:45:14] 10Data-Engineering, 10Equity-Landscape: Editorship Output Rank Metrics - https://phabricator.wikimedia.org/T306618 (10KCVelaga_WMF) @JAnstee_WMF All the regional aggregations also now fully align. You can see the full comparison at https://docs.google.com/spreadsheets/d/1B9vZc8BI7zLrZhyM7XNAYQXN_XSMLyw5CJgBjxO... [13:45:30] 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10KCVelaga_WMF) @JAnstee_WMF All the regional aggregations also now fully align. You can see the full comparison at https://docs.google.com/spreadsheets/d/1B9vZc8BI7zLrZhyM7XNAYQXN_XSMLyw5CJgBjxO... [13:46:47] 10Data-Engineering, 10Equity-Landscape: Editorship Output Rank Metrics - https://phabricator.wikimedia.org/T306618 (10KCVelaga_WMF) @JAnstee_WMF All regional aggregations also align well now. You can see my comparisons at: https://docs.google.com/spreadsheets/d/1LhdxdVUCMXfmvK2xubnbvB7s4mRR5xEsnb2w1QUggmU/edit... [13:50:21] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:38:35] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 03), 10Platform Team Initiatives (Modern Event Platform (TEC2)): Allow disabling/enabling configured streams via wgEventStreams config - https://phabricator.wikimedia.org/T259712 (10lbowmaker) [14:39:01] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 03): Refactor EventBus extension Hooks to use new hook system - https://phabricator.wikimedia.org/T320655 (10lbowmaker) [15:32:45] (03PS7) 10Mforns: Fix end-of-month/year allowed_interval issue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746) [15:50:27] (03CR) 10Mforns: [V: 03+2] "OK, I think this time it's good!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746) (owner: 10Mforns) [16:03:28] (03CR) 10Joal: Fix end-of-month/year allowed_interval issue (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746) (owner: 10Mforns) [16:03:39] mforns: couple of comments to help my understanding :) [16:03:46] joal: sure! [16:03:52] looking [16:16:09] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:18:44] hey mforns - would you give me a few minutes in da cave? [16:18:51] yessss joal [16:27:19] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:41:25] (03CR) 10Joal: Fix end-of-month/year allowed_interval issue (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746) (owner: 10Mforns) [16:44:55] (03PS8) 10Mforns: XFix end-of-month/year allowed_interval issue [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746) [16:55:55] Hi dcausse - your job is using 1/2 of the cluster RAM :) [16:56:20] dcausse: no big deal so far, but that's more than what we usually accept from users [16:56:29] I need to leave now, I'll recheck later on [17:45:53] 10Data-Engineering, 10Data-Persistence, 10Image-Suggestions: Section Level Image Suggestions - Data Persistence Request - https://phabricator.wikimedia.org/T320831 (10lbowmaker) [17:46:51] (03PS3) 10Milimetric: [WIP] Collaborate on a new editors dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/838256 [18:13:14] joal: sorry about that, will stop it [19:06:20] thanks dcausse :) [19:15:53] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:38:23] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:40:12] (VarnishkafkaNoMessages) firing: (4) varnishkafka on cp5015 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:45:12] (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp5015 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:34:25] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:19:25] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:31:11] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 02), 10Spike: [SPIKE] Build simple stateless service using PyFlink - https://phabricator.wikimedia.org/T318859 (10tchin) [[ https://gitlab.wikimedia.org/tchin/stateless-pyflink-examples/-/tree/main/ | Here's the with example datastream and table equiv... [22:37:59] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:34:05] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring