[04:16:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [04:26:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [05:49:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_07 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [06:09:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2022_07 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [08:11:34] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform, 10Platform Engineering, and 8 others: eventlogging_VisualEditorTemplateDialogUse: '.event.template_names[0]' should be string - https://phabricator.wikimedia.org/T299779 (10thiemowmde) 05Open→03Resolved [08:14:40] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform, 10Platform Engineering, and 8 others: eventlogging_VisualEditorTemplateDialogUse: '.event.template_names[0]' should be string - https://phabricator.wikimedia.org/T299779 (10thiemowmde) [11:34:41] Hi, stat1008 seems really slow. I think this might be caused by 3 python scripts from shubhankar (18227, 18002, 17723 ). I reached out to him to check but he is not able to kill those scripts himself . could you help with that? [11:36:17] HI mgerlach: I'll look into it now for you. I'm obviously hesitant to kill other people's tasks, but if we've exhausted all other options... [11:39:07] Yes I see, it's definitely not a happy computer at the moment. [11:40:24] btullis: understood. I dont think that shubhankar is is this channel but, if it helps, I am the point of contact as part of a formal collaboration in research and he asked me for help since he couldnt resolve https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations though happy to consider any other option to solve the issue [11:40:32] thanks for looking into it [11:42:41] thanks for the help btullis: stat1008 seems to run more smoothly again [11:44:22] I didn't actually do anything, the kernel out-of-memory killer stepped in before I could carry out any action. [11:44:35] https://www.irccloud.com/pastebin/j54cVMtc/ [11:46:03] Actually, those timestamps are from earlier. Perhaps 18002 just ran to completion. [11:48:40] Definitely better anyway. [11:48:42] https://usercontent.irccloud-cdn.com/file/6AuLNkFz/image.png [12:37:14] (03CR) 10Ottomata: [C: 03+1] "I believe the diff is big because we recently fixed a bug in jsonschema-tools where field ordering in JSON files was not consistent. Just" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/819043 (https://phabricator.wikimedia.org/T314151) (owner: 10Phuedx) [12:52:59] (03CR) 10Ottomata: Update ua-parser library (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/818083 (https://phabricator.wikimedia.org/T306829) (owner: 10Aqu) [12:54:22] !log sudo systemctl reset-failed on stat1008 to remove failed debmonitor alerts [12:54:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:54:55] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:03] (03PS2) 10Phuedx: mediawiki/client/metrics_event: Make custom_data a map type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/819043 (https://phabricator.wikimedia.org/T314151) [13:20:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:59] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:31:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:43] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:48:03] looking at the Druid segment alarm that fired this morning, it doesn't make sense because mw reduced 2022-07 won't be available for another couple days or so [14:01:12] (VarnishkafkaNoMessages) firing: ... [14:01:12] varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp2029:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:06:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:11:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:16:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:19:32] milimetric: It looks like it's already available according to this graph: [14:19:33] https://usercontent.irccloud-cdn.com/file/LBCC88yB/image.png [14:19:42] https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m&var-cluster=druid_public&var-datasource=eqiad%20prometheus%2Fanalytics&var-druid_datasource=mediawiki_history_reduced_2022_07&from=now-24h&to=now [14:19:57] ...but I'm sure that you understand the pipeline better than I do. [14:20:05] I may not... [14:20:23] I'll check after my meeting [14:21:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:36:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:37:47] 10Analytics-Radar, 10Discovery, 10Discovery-Analysis: UDF for language detection - https://phabricator.wikimedia.org/T182352 (10RhinosF1) 05Declined→03Open Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham. [14:46:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:51:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:56:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:57:10] 10Analytics-Radar, 10Discovery, 10Discovery-Analysis: UDF for language detection - https://phabricator.wikimedia.org/T182352 (10mpopov) 05Open→03Invalid [15:01:12] (VarnishkafkaNoMessages) firing: (8) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:11:12] (VarnishkafkaNoMessages) firing: (5) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:16:12] (VarnishkafkaNoMessages) firing: (5) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:21:12] (VarnishkafkaNoMessages) firing: (5) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:38:06] PROBLEM - Host an-worker1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:12] (VarnishkafkaNoMessages) firing: (5) varnishkafka for instance cp2027:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:43:46] RECOVERY - Host an-worker1082.mgmt is UP: PING WARNING - Packet loss = 77%, RTA = 1.35 ms [15:45:57] 10Data-Engineering: RAID battery alert in an-worker1082 - https://phabricator.wikimedia.org/T311991 (10Cmjohnson) [15:46:12] (VarnishkafkaNoMessages) firing: ... [15:46:12] varnishkafka for instance cp2037:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp2037:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:51:12] (VarnishkafkaNoMessages) firing: (2) varnishkafka for instance cp2037:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:11:48] (03CR) 10Milimetric: Add metric_id column to Wikidata EntitySchema text HQL (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/817837 (owner: 10Michael Große) [16:13:52] I figured out what the weirdness was with mw history / druid segments! It's a lot faster this month 'cause of the refactored sqoop job [16:18:58] (03CR) 10Milimetric: [C: 03+2] Update ua-parser library (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/818083 (https://phabricator.wikimedia.org/T306829) (owner: 10Aqu) [16:43:05] 10Data-Engineering: Standardize where Hive table creation scripts go - https://phabricator.wikimedia.org/T314415 (10Milimetric) [16:54:31] (03CR) 10Phuedx: mediawiki/client/metrics_event: Make custom_data a map type (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/819043 (https://phabricator.wikimedia.org/T314151) (owner: 10Phuedx) [16:54:47] 10Data-Engineering-Radar, 10Event-Platform, 10Generated Data Platform: Add Event Platform timestamp JSONSchema -> Flink type support - https://phabricator.wikimedia.org/T310495 (10Ottomata) [17:02:36] (VarnishkafkaNoMessages) firing: (3) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:02:52] 10Data-Engineering-Radar, 10Event-Platform, 10Generated Data Platform: Add Event Platform timestamp JSONSchema -> Flink type support - https://phabricator.wikimedia.org/T310495 (10Ottomata) > However, it is possible, especially in client side submitted instrumentation event data, for date-times to come in wi... [17:11:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:21:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:31:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:11:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:21:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:46:12] (VarnishkafkaNoMessages) firing: (4) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:41:34] (VarnishkafkaNoMessages) firing: (5) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:46:12] (VarnishkafkaNoMessages) firing: (5) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:51:12] (VarnishkafkaNoMessages) resolved: (5) varnishkafka for instance cp2029:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages