[00:18:07] PROBLEM - Check unit status of refinery-drop-eventlogging-legacy-raw-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-eventlogging-legacy-raw-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:20:11] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-drop-eventlogging-legacy-raw-partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:55] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:16] PROBLEM - Check unit status of refinery-drop-webrequest-refined-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-webrequest-refined-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:53:26] PROBLEM - Check unit status of drop-features-actor-rollup-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-features-actor-rollup-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:55:08] PROBLEM - Check unit status of refinery-drop-pageview-actor-hourly-partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-pageview-actor-hourly-partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:56:02] PROBLEM - Check unit status of drop-predictions-actor_label-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-predictions-actor_label-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:08:33] PROBLEM - Check unit status of drop-features-actor-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-features-actor-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:24:07] PROBLEM - Check unit status of drop_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:30:37] PROBLEM - Check unit status of refinery-drop-raw-netflow-event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-raw-netflow-event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:09:25] PROBLEM - Check unit status of refinery-drop-banner-activity on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-drop-banner-activity https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:20:57] PROBLEM - SSH on analytics1075.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:22:13] RECOVERY - SSH on analytics1075.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:38:55] PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:48] RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1081%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:37:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1081%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:37:28] !log Rerun airflow unique_devices_dailyschedule: @daily [06:37:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:38:03] !log Try to rerun airflow unique_devices_daily.compute_per_project_family_metrics.2022-09-15 [06:38:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:54:59] (03PS5) 10Awight: Maps interaction event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) [06:55:14] (03CR) 10Awight: Maps interaction event schema (038 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [06:58:15] Remind me, is it discouraged to fancy jsonschema stuff like use "oneOf" to only require a field when another field has a certain enum value, for example? [08:23:19] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics-Platform: Allow JavaScript errors to fail CI builds - https://phabricator.wikimedia.org/T318902 (10kostajh) [09:28:07] PROBLEM - SSH on analytics1075.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:39:12] (03PS1) 10Thiemo Kreuz (WMDE): Limit HTTP status code to 0…599 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/836735 [09:39:45] (03CR) 10CI reject: [V: 04-1] Limit HTTP status code to 0…599 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/836735 (owner: 10Thiemo Kreuz (WMDE)) [09:41:14] (03PS1) 10Thiemo Kreuz (WMDE): Limit HTTP status code to 0…599 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 [09:41:42] (03CR) 10CI reject: [V: 04-1] Limit HTTP status code to 0…599 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 (owner: 10Thiemo Kreuz (WMDE)) [09:42:17] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Maps interaction event schema (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [09:44:06] 10Analytics-Clusters, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I'll be working on this as part of a larger effo... [09:44:11] 10Analytics-Clusters, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10MoritzMuehlenhoff) [09:54:03] 10Analytics-Clusters, 10Data-Engineering-Radar, 10Infrastructure-Foundations, 10SRE, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10BTullis) >>! In T258700#8271831, @MoritzMuehlenhoff wrote: > I'll be working on this... [10:29:17] RECOVERY - SSH on analytics1075.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:36:28] (03PS6) 10Awight: Maps interaction event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) [10:36:30] (03CR) 10Awight: Maps interaction event schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [10:43:03] (03PS7) 10Awight: Maps interaction event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) [10:44:15] (03CR) 10Awight: "PS 7: Split out the marker paint event so that we don't lose show-nearby events when marker paint was unsuccessful, and to simplify callba" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [10:54:07] (03CR) 10Awight: "Run this command after modifying the schema and manually incrementing the version number:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 (owner: 10Thiemo Kreuz (WMDE)) [10:55:07] (03CR) 10Awight: Limit HTTP status code to 0…599 (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 (owner: 10Thiemo Kreuz (WMDE)) [12:01:05] (03PS8) 10Awight: Maps interaction event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) [12:22:21] 10Data-Engineering, 10Data Pipelines (Sprint 02): Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10EChetty) [12:22:33] 10Data-Engineering, 10Data Pipelines (Sprint 02): Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10EChetty) [12:34:30] !log Rerun failed oozie webrequest-load-wf-text-2022-9-29-9 [12:34:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:35:47] (03PS1) 10Joal: Fix unique-devices per project-family HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836813 (https://phabricator.wikimedia.org/T305841) [15:35:12] (VarnishkafkaNoMessages) firing: (3) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:40:12] (VarnishkafkaNoMessages) resolved: (3) varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:41:42] (VarnishkafkaNoMessages) firing: (4) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:44:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:46:42] (VarnishkafkaNoMessages) resolved: (4) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:49:57] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp1085 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:53:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:54:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:54:42] (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp1085 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:56:58] hi btullis - there have been quite some varnishkafka errors today - would you be aware of any prod change happening that could lead to those? [15:58:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:01:29] joal: No, I'm not aware of anything that would cause these. Investigating. [16:01:37] thanks a lot btullis [16:02:03] see https://grafana.wikimedia.org/d/pr6ZUm5nz/haproxy-cluster-view?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=text&from=now-1h&to=now [16:02:11] there was an attack that matches up with this period [16:03:12] Thanks sukhe - I'll investigate why this triggers the alert. [16:03:20] thanks btullis [16:04:32] thanks sukhe [16:11:57] xcollazo: asking here :) What about using Row extension function to implement getOptX ? [16:12:41] joal: don't know about that feature. You have a link? [16:12:53] xcollazo: let's batcave if you wish for aminute [16:12:58] yes [16:16:05] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10EChetty) [16:16:38] 10Data-Engineering, 10Product-Analytics, 10wmfdata-python, 10Data Pipelines (Sprint 02): Upgrade WMFData Python Package to use Spark3 - https://phabricator.wikimedia.org/T318587 (10EChetty) [17:15:58] (03CR) 10Joal: [C: 04-1] Fix end-of-month/year allowed_interval issue (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/836295 (https://phabricator.wikimedia.org/T316746) (owner: 10Mforns) [17:32:26] (03CR) 10Sbisson: [C: 03+2] Add Wikistories contribution_attempt_id [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836266 (https://phabricator.wikimedia.org/T317934) (owner: 10Neil P. Quinn-WMF) [17:33:21] (03Merged) 10jenkins-bot: Add Wikistories contribution_attempt_id [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836266 (https://phabricator.wikimedia.org/T317934) (owner: 10Neil P. Quinn-WMF) [17:49:38] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Maps interaction event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/835621 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [18:09:44] 10Analytics, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10apaskulin) [19:59:54] 10Data-Engineering, 10Equity-Landscape: Editorship Input Metrics - https://phabricator.wikimedia.org/T309274 (10JAnstee_WMF) @ntsako did you see my earlier note? If I could get your code for the extract here like the reader inputs that would be super! Also, there was an inversion correction needed for the Yo... [20:26:36] 10Analytics-Radar, 10Domains, 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle) >>! In T262996#8002643, @Nemo_bis wrote: > Is this related to "T... [20:26:54] 10Analytics-Radar, 10Domains, 10SRE, 10Traffic-Icebox, and 2 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle) [20:31:24] 10Data-Engineering, 10Equity-Landscape: Readership input metrics - https://phabricator.wikimedia.org/T309273 (10JAnstee_WMF) @ntsako Initial review complete see comparison at: https://docs.google.com/spreadsheets/d/1QMkbIb1Buv1tpTdvc4bnQOD5jI9xbGmg4nqnwjCT15M/edit#gid=160810869 while numbers don't align for... [20:33:38] 10Data-Engineering, 10Equity-Landscape: Readership input metrics - https://phabricator.wikimedia.org/T309273 (10JAnstee_WMF) a:05JAnstee_WMF→03ntsako [20:39:51] 10Data-Engineering, 10Equity-Landscape: Editorship Input Metrics - https://phabricator.wikimedia.org/T309274 (10JAnstee_WMF) a:05JAnstee_WMF→03ntsako [20:40:02] RECOVERY - Check unit status of drop-features-actor-hourly on an-launcher1002 is OK: OK: Status of the systemd unit drop-features-actor-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:53:32] PROBLEM - Check unit status of drop-features-actor-hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-features-actor-hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:11:13] 10Data-Engineering, 10Equity-Landscape: Readership Output Rank Metrics - https://phabricator.wikimedia.org/T306617 (10JAnstee_WMF) @KCVelaga Initial pass complete - See comparisons here: https://docs.google.com/spreadsheets/d/1KabvexssiW5CWGZzbafgvrCfw0qlXzeNm3TiyUePiEo/edit#gid=2059162494 Notes: * Fairly cl...