[00:39:13] (VarnishkafkaNoMessages) firing: varnishkafka on cp3055 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3055%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [00:44:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp3055 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3055%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:25:46] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.188 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [02:46:54] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.1802 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [07:29:31] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10gmodena) Thanks for this @Ottomata. From the point of view of a k8s and helm operator, are there best practices we should follow... [10:51:56] !log Deploying refinery for ops week [10:51:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:55:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10gmodena) [11:04:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:19] Hi joal We got this during the refinery-deploy-to-hdfs step of refinery deploy https://phabricator.wikimedia.org/P43526 any ideas [11:21:54] steve_munene: joal: I suspect that we should just run that step again, but it's a bit troubling. One `dfs -put` operation and one `dfs -cp` operation failed. [11:23:55] Also perhaps the final swap of `/wmf/refinery/current` (https://phabricator.wikimedia.org/P43526$29) shouldn't run if any of the previous steps exited with an error. Have we seen this before? [11:26:18] Does this mean an immediate stop once there is an error? [11:30:33] Well, that's what I'm wondering. We already have `-e` set in the bash script `refinery-deploy-to-hdfs` here: https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/bin/refinery-deploy-to-hdfs#16 [11:31:46] ...but we saw errors from two steps and the script continued running. [11:34:18] steve_munene: I suggest running the `refinery-deploy-to-hdfs` step of the deployment again. [11:35:30] alright going ahead with the rerun [11:39:41] got the same error [11:41:03] Interesting. Exactly the same two errors? [11:43:01] more errors this time btullis joal https://phabricator.wikimedia.org/P43527 [11:44:03] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Fix eventutilities-python linting - https://phabricator.wikimedia.org/T328547 (10gmodena) [11:44:37] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Fix eventutilities-python linting - https://phabricator.wikimedia.org/T328547 (10gmodena) [12:01:37] 10Data-Engineering-Planning: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10BTullis) @kzimmerman - I have added the sql_lab role to all of those users' accounts in Superset. @JAnstee_WMF - I have added the sql_lab role to all of those user's account in... [12:23:11] 10Data-Engineering-Planning: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10BTullis) I think it also makes sense to add: * all members of the DE team as //sql_lab// users * all SREs as //Admin// users @odimitrijevic - do you have thoughts on a policy... [12:26:37] steve_munene: It might be that we just hit a particularly busy time for the Hadoop cluster, given that it's the first of the month. Check out this graph. https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=112 [12:27:29] Maybe if we keep an eye on it and retry this step when it is in a lull, it will succeed. [12:33:26] It would be nice to add some labels to the Y-axis of those Hadoop charts. It's not easy to tell from that what the actual data rate is. It looks to me like it's a cumulative amount of data written per 5 minutes, which isn't very intuitive. [12:34:14] https://usercontent.irccloud-cdn.com/file/OxO4bCtr/image.png [12:42:08] cool we can try at a later time [12:43:27] How about now? That little peak at 12:24 ish seems to have dropped off, so the cluster looks quiet at the moment. [12:54:35] Hi folks - Indeed 1st of the month is a busy time for the cluster - The errors you see seems due to too much load, either on the cluster itself or on an-launcher1002 [12:56:19] It could be worth a try to execute the script from an-coord1001 [12:56:25] I think it has the right keytabs [13:33:55] Thanks joal : We could try that btullis Missed the window you had mentioned [13:45:20] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)): Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10EChetty) [13:48:59] Hi milimetric - I know it's early for you - let me know when you have a minute to talk about the webrequest_actor PR [13:49:57] I'll be back from dropoff in 45 min joal [13:50:19] ack milimetric - Let's see if it fits my kids siesta :) [14:02:45] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Superset: Error when displaying box plots in Superset - https://phabricator.wikimedia.org/T328047 (10BTullis) Hi @Pablo - Thanks so much for reporting this. It definitely seems like a regression, compared with the previous version. I will try to e... [14:09:31] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Superset: Error when displaying box plots in Superset - https://phabricator.wikimedia.org/T328047 (10Pablo) Thanks @BTullis! [14:10:13] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Superset: Error when displaying box plots in Superset - https://phabricator.wikimedia.org/T328047 (10BTullis) The [[https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/865609|latest build]] of Superset included version 1.24 of numpy: `art... [14:30:50] ok joal, I'm back, lemme know and I'll jump in the cave [14:32:20] 10Data-Engineering, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10Ottomata) >> In refinery-source, we have a Scala ConfigHelper that does just this. > Would it make sense to provide a Python wrappe... [14:37:49] Hey milimetric - batcave now? [14:39:29] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): eventutilities-python should support nested row type info - https://phabricator.wikimedia.org/T327900 (10gmodena) a:03gmodena [14:41:03] omw [14:42:41] ottomata: o/ hiii IIRC when new streams are merged and deployed in mediawiki-config then a roll restart of the eventgate main pod is needed right? [14:44:38] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Fix eventutilities-python linting - https://phabricator.wikimedia.org/T328547 (10Ottomata) Fr eventutilities-python, let's just do that flake8 and mypy stuff you got there asap! [14:49:44] 10Data-Engineering-Planning: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10Ottomata) > all members of the DE team as sql_lab users Maybe make all members of the DE team as Admin users? [14:52:04] 10Data-Engineering, 10Event-Platform Value Stream, 10Epic: Flink Operations - https://phabricator.wikimedia.org/T328561 (10lbowmaker) [14:56:51] 10Data-Engineering, 10Event-Platform Value Stream: How to handle restarting a Flink application - https://phabricator.wikimedia.org/T328563 (10lbowmaker) [14:57:17] 10Data-Engineering, 10Event-Platform Value Stream: [Flink Operations] How to handle restarting a Flink application - https://phabricator.wikimedia.org/T328563 (10lbowmaker) [15:00:30] 10Data-Engineering, 10Event-Platform Value Stream: [Flink Operations] Automate Replay of Failed Events - https://phabricator.wikimedia.org/T328565 (10lbowmaker) [15:10:01] 10Data-Engineering, 10Event-Platform Value Stream: [Flink Operations] Automate Replay of Failed Events - https://phabricator.wikimedia.org/T328565 (10lbowmaker) a:05lbowmaker→03None [15:12:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:38] 10Data-Engineering, 10Event-Platform Value Stream: [Flink Operation] How to handle app upgrades - https://phabricator.wikimedia.org/T328569 (10lbowmaker) [15:15:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:55] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Superset: Error when displaying box plots in Superset - https://phabricator.wikimedia.org/T328047 (10BTullis) The latest version of `numpy` used in the upstream requirements is [[https://github.com/apache/superset/blob/master/requirements/base.txt#... [16:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:33] (03PS1) 10Mazevedo: Add legacy schema MobileWikiAppiOSReadingLists to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885835 (https://phabricator.wikimedia.org/T328487) [16:04:23] (03CR) 10CI reject: [V: 04-1] Add legacy schema MobileWikiAppiOSReadingLists to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885835 (https://phabricator.wikimedia.org/T328487) (owner: 10Mazevedo) [16:07:59] (03PS1) 10Btullis: Build new version of superset with pinned numpy version [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/885836 (https://phabricator.wikimedia.org/T328047) [16:13:49] (03PS1) 10Mazevedo: Add required fields to wikipedia_ios_app fragment to avoid repetition [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885837 (https://phabricator.wikimedia.org/T328487) [16:15:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:15] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Superset, 10Patch-For-Review: Error when displaying box plots in Superset - https://phabricator.wikimedia.org/T328047 (10BTullis) I have made a test deployment of this version to superset-next. The box plot appears to be working in this version.... [16:19:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:56] (03PS15) 10Snwachukwu: Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) [16:26:25] 10Data-Engineering-Planning: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10HXi-WMF) Hi! Sorry just wanted to check in on this again. I am able to ssh, but when I try to sign into Jupyter notebook with my shell username (xihua) and LDAP password (Wikitech password), it te... [16:29:55] (03CR) 10Tsevener: Add MobileWikiAppiOSUserHistory to MEP (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885004 (https://phabricator.wikimedia.org/T328312) (owner: 10Mazevedo) [16:34:06] 10Data-Engineering-Planning, 10Machine-Learning-Team, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10elukey) 05Open→03Resolved a:03elukey I am going to close this task sin... [16:36:42] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07), 10Product-Analytics (Kanban): Include EU Registered Country in the canonical country database - https://phabricator.wikimedia.org/T324995 (10mforns) @nshahquinn-wmf I saw you merged the changes, thank you! :-) Please, would it be possible for you to... [16:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:00] 10Data-Engineering-Planning: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10BTullis) >>! In T328457#8578608, @Ottomata wrote: >> all members of the DE team as sql_lab users > > Maybe make all members of the DE team as Admin users? Done. Eveyone from t... [16:49:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:05] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Superset, 10Patch-For-Review: Error when displaying box plots in Superset - https://phabricator.wikimedia.org/T328047 (10Pablo) @BTullis I confirm that box plots appear in this version. That was quick, thank you very much! [17:11:30] 10Data-Engineering-Planning: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10kzimmerman) Thanks @BTullis , I'll ask my team to check and we'll reach out if we need anything else! [17:13:44] btullis: the load seems to have gone down a bit ready to rerun the refinery-deploy-to-hdfs [17:27:27] steve_munene: feel free [17:30:55] 10Data-Engineering, 10Research-Backlog: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10Ladsgroup) I'm sorry If I'm stating the obvious but this really looks like [[https://en.wikipedia.org/wiki/Anomaly_detection|a classic case of anomaly detection]] which the... [17:30:57] last run results still had an exception https://phabricator.wikimedia.org/P43556 btullis : joal [17:31:47] that's really weird - it's the first time I see those :( [17:37:12] (03CR) 10Tsevener: [C: 04-1] Add required fields to wikipedia_ios_app fragment to avoid repetition (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885837 (https://phabricator.wikimedia.org/T328487) (owner: 10Mazevedo) [17:53:21] (03PS1) 10Mazevedo: Add required fields to wikipedia_ios_app fragment to avoid repetition [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885849 (https://phabricator.wikimedia.org/T328487) [17:56:59] (03Abandoned) 10Mazevedo: Add required fields to wikipedia_ios_app fragment to avoid repetition [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885837 (https://phabricator.wikimedia.org/T328487) (owner: 10Mazevedo) [18:00:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:48] (03CR) 10Tsevener: [C: 03+2] Add required fields to wikipedia_ios_app fragment to avoid repetition [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885849 (https://phabricator.wikimedia.org/T328487) (owner: 10Mazevedo) [18:04:30] (03Merged) 10jenkins-bot: Add required fields to wikipedia_ios_app fragment to avoid repetition [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885849 (https://phabricator.wikimedia.org/T328487) (owner: 10Mazevedo) [18:04:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:17] (03PS14) 10Mazevedo: Add MobileWikiAppiOSUserHistory to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885004 (https://phabricator.wikimedia.org/T328312) [18:23:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1075%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:28:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1075%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:08:23] 10Data-Engineering-Planning, 10API Platform, 10AQS2.0: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10JArguello-WMF) 05Open→03Resolved [19:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:13] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Geo Analytics Service - https://phabricator.wikimedia.org/T288305 (10JArguello-WMF) [19:22:52] 10Data-Engineering-Planning: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10mpopov) > when I try to sign into Jupyter notebook with my shell username (xihua) and LDAP password (Wikitech password), it tells me username and password not valid @EChetty: Hua is providing th... [19:28:40] 10Data-Engineering, 10API Platform (Sprint 04), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) [19:30:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:05] 10Data-Engineering-Planning, 10API Platform, 10AQS2.0: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10JArguello-WMF) [19:31:32] 10Analytics, 10API Platform, 10AQS2.0, 10Code-Health-Objective: Synchronize .gitignore files - https://phabricator.wikimedia.org/T315113 (10JArguello-WMF) 05Open→03Resolved [19:32:14] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10EChetty) [19:33:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:47] 10Data-Engineering, 10Equity-Landscape: Add country_meta_data - https://phabricator.wikimedia.org/T324681 (10JAnstee_WMF) [19:34:26] 10Data-Engineering, 10Equity-Landscape: Add country_meta_data - https://phabricator.wikimedia.org/T324681 (10JAnstee_WMF) [19:39:53] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10EChetty) p:05Triage→03High [19:42:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp5021 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5021%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:47:13] (VarnishkafkaNoMessages) resolved: varnishkafka on cp5021 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp5021%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:57:24] (03PS15) 10Mazevedo: Add MobileWikiAppiOSUserHistory to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885004 (https://phabricator.wikimedia.org/T328312) [19:59:11] (03CR) 10Mazevedo: "Updated!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885004 (https://phabricator.wikimedia.org/T328312) (owner: 10Mazevedo) [20:01:08] (03PS2) 10Mazevedo: Add legacy schema MobileWikiAppiOSReadingLists to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885835 (https://phabricator.wikimedia.org/T328487) [20:01:36] (03CR) 10CI reject: [V: 04-1] Add legacy schema MobileWikiAppiOSReadingLists to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885835 (https://phabricator.wikimedia.org/T328487) (owner: 10Mazevedo) [20:10:52] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10EChetty) Hey @CDanis! Sorry to respond super late - The team and I have been trying... [20:17:53] (03PS3) 10Mazevedo: Add legacy schema MobileWikiAppiOSReadingLists to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885835 (https://phabricator.wikimedia.org/T328487) [20:29:10] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07), 10Product-Analytics (Kanban): Include EU Registered Country in the canonical country database - https://phabricator.wikimedia.org/T324995 (10nshahquinn-wmf) @mforns of course! I just deployed it. Leaving this open in case @EChetty needs to sign off. [20:31:56] (03PS1) 10Ottomata: Finalize mediawiki/page/change schema at 1.0.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/885877 (https://phabricator.wikimedia.org/T308017) [20:33:04] (03PS2) 10Ottomata: Finalize mediawiki/page/change schema at 1.0.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/885877 (https://phabricator.wikimedia.org/T308017) [20:38:13] (03CR) 10Ottomata: [C: 03+2] Finalize mediawiki/page/change schema at 1.0.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/885877 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [20:41:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp3064 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp3064%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:44:37] (03CR) 10Tsevener: Add MobileWikiAppiOSUserHistory to MEP (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885004 (https://phabricator.wikimedia.org/T328312) (owner: 10Mazevedo) [20:45:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp3064 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp3064%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:48:25] (03PS16) 10Mazevedo: Add MobileWikiAppiOSUserHistory to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885004 (https://phabricator.wikimedia.org/T328312) [20:48:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:53] (03PS17) 10Mazevedo: Add MobileWikiAppiOSUserHistory to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885004 (https://phabricator.wikimedia.org/T328312) [20:51:04] (03PS18) 10Mazevedo: Add MobileWikiAppiOSUserHistory to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885004 (https://phabricator.wikimedia.org/T328312) [20:55:25] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10Ottomata) I think the best path forward right now is: 1. 4 new ganeti instances in... [21:00:14] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Wikipedia-iOS-App-Backlog, and 5 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10SNowick_WMF) a:03SNowick_WMF [21:06:35] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) I've merged the schema out of the development namespace at `/mediawiki/page/change/1.0.0`.... [21:08:42] (03CR) 10Mazevedo: Add MobileWikiAppiOSUserHistory to MEP (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885004 (https://phabricator.wikimedia.org/T328312) (owner: 10Mazevedo) [21:12:24] (03PS1) 10Ottomata: Remove development/ mediawiki page change schemas [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/885886 (https://phabricator.wikimedia.org/T308017) [21:13:34] 10Data-Engineering, 10API Platform (Sprint 04), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) @SGupta-WMF Surbhi, please confirm that Bill's analysis is correct and, if so, create bug tickets... [21:14:20] 10Data-Engineering, 10API Platform (Sprint 04), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) a:05EChukwukere-WMF→03SGupta-WMF [21:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:07] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) Thanks! I think a Ganeti VM would be fine. Can I ask, what's the issue wi... [21:20:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:44] ottomata: thanks for getting back to me <3 just one question for you re: other k8s clusters (assuming that the DSE cluster was the issue and not mirrormaker-on-k8s in general?) [21:48:07] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10TheresNoTime) FTR, I [[ https://meta.wikimedia.org/wiki/Special:Redirect/logid... [21:59:51] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10BTullis) Hello @HXi-WMF I'm sorry that you're having such trouble with this part of the process. I will do my best to help you to get this sorted as soon as possi... [22:03:16] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10BTullis) Here is the LDAP request ticket: {T328607} [22:15:14] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10BTullis) I have now added you to the `wmf` group in LDAP, so please would you try again to access JupyterHub @HXi-WMF. Here is confimation of that by means of an LD... [22:24:14] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10mpopov) Thank you so much @BTullis for looking into it and performing the necessary fixes. I also appreciate the detailed notes from your investigation. [22:30:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:38] 10Data-Engineering-Planning, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python's CSV loading cannot handle standard quoted CSV values - https://phabricator.wikimedia.org/T327983 (10nshahquinn-wmf) As @mforns [reminded me](https://github.com/wikimedia-research/canonical-data/pull/3#issuecomment-1411103219... [22:50:13] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Superset, 10Patch-For-Review: Error when displaying box plots in Superset - https://phabricator.wikimedia.org/T328047 (10BTullis) Great. Thanks for the confirmation. I'll try to deploy to production tomorrow. [22:55:49] (03PS1) 10Mazevedo: Add iOS schema MobileWikiAppiOSSearch to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885904 (https://phabricator.wikimedia.org/T328604) [22:56:19] (03CR) 10CI reject: [V: 04-1] Add iOS schema MobileWikiAppiOSSearch to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885904 (https://phabricator.wikimedia.org/T328604) (owner: 10Mazevedo) [22:58:49] (03PS2) 10Mazevedo: Add iOS schema MobileWikiAppiOSSearch to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/885904 (https://phabricator.wikimedia.org/T328604) [23:04:45] 10Data-Engineering, 10Research-Backlog: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10leila) @Ladsgroup indeed this is by no means an unsolvable problem at the theoretical level (and there is an existing solution for it in place). However, depending on the s... [23:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:23] 10Data-Engineering, 10API Platform (Sprint 04), 10AQS2.0, 10Code-Health-Objective, and 2 others: Dashboards for AQS 2.0 - https://phabricator.wikimedia.org/T288667 (10JArguello-WMF) a:03BPirkle [23:37:12] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics, and 2 others: Analyze possible bot traffic for frwiki article Cookie (informatique) - https://phabricator.wikimedia.org/T313114 (10kzimmerman) Wanted to note a couple of things I had explored earlier: - I searched p...