[00:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:18:21] (SystemdUnitFailed) firing: refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:50:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:18:21] (SystemdUnitFailed) firing: refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:18:21] (SystemdUnitFailed) firing: refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:50:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:18:21] (SystemdUnitFailed) firing: refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [16:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:18:21] (SystemdUnitFailed) firing: refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:26:18] team, looking at webrequest errors, there are duplicate map keys, trying to understand what that implies [18:26:42] I think we can remove the problematic rows adding them to the excluded_row_ids parameter [18:40:06] hm, there are no duplicate (hostname, sequence) pairs in webrequest for the affected hours... [18:40:14] I executed: select hostname, sequence, count(*) as freq from webrequest where year=2023 and month=11 and day=17 and hour=22 group by hostname, sequence order by freq desc limit 10; [18:40:32] (in wmf_raw) [18:41:12] not sure what the "map key" is... [18:45:13] ah... maybe x_analytics header? [18:51:48] ok, I think I understand the error message now, some x_analytics headers have duplicated 'pageview' keys... [18:51:52] digging more [19:04:26] yesssss: [19:04:32] https://www.irccloud.com/pastebin/LPwGQ0z0/ [19:09:39] rerunning with excluded row [19:31:16] (EventgateValidationErrors) firing: ... [19:31:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:47:52] !log reran Airflow's refine_webrequest_hourly_text::refine_webrequest with excluded_row_ids for 2023-11-17T22 [19:47:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:50:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [20:03:22] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:05:21] hm, the first hour (2023-11-17T22) worked, but the second (2023-11-18T12) when I filter our the duplicate rows, it fails with a similar message, indicating another duplicate key problem that I can not reproduce with queries... [20:31:16] (EventgateValidationErrors) resolved: ... [20:31:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:43:15] (EventgateValidationErrors) firing: ... [20:43:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:01:19] ok, the current query doesn't work for more than 1 corrupted row of the same hostname... because it tries to transform the passed (hostname,sequence) pairs into a map(), and it does not support duplicate keys. So if 2 of the corrupted rows belong to the same hostname, then it fails. [21:01:39] I'll try to quick-fix the query and sync it to hdfs to fix prod [21:08:16] (EventgateValidationErrors) resolved: ... [21:08:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:18:21] (SystemdUnitFailed) firing: refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:57:26] !log eran Airflow's refine_webrequest_hourly_text::refine_webrequest with excluded_row_ids for 2023-11-18T12 [21:57:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:59:06] it worked, on Monday I will make sure that all docs are updated and stuff [22:02:40] (03PS1) 10Mforns: Quick fix to refine_webrequest_hourly for exclude_row_ids [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975418 [22:03:11] Here's the quick fix change ^^^ [23:50:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange