[00:15:00] PROBLEM - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:27:15] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.7; 2022-04-11), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10dom_walden) @Zabe On my local machine, running `php maintenance/update.php`: `... [08:39:07] !log powercycle an-worker1094 - OEM event registered in `racadm getsel`, host frozen [08:39:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:15:37] 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) [09:47:00] uh oh, sqoop failed? [09:54:47] milimetric: yep.. on etwiki apparently [09:54:58] checking vgutierrez, thanks [09:56:05] milimetric: Let me know if there's anything I can do to help. [09:56:33] it's probably a schema change, btullis, I'll need to make a patch before the kids wake up or hand it off to Jo [10:02:50] btullis: looks like we need to run maintain_views, I'll ping Amir to see what happened, maybe we need a change in the views too [10:02:55] "java.sql.SQLException: View 'etwiki_p.revision' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them" [10:14:25] ok, asked in persistence. Our sqoop job is stuck for 3 days at least. What should we do with the alert, btullis? Is there a way to acknowledge without resetting the job? [10:14:45] I'll reply to the email so others know [10:15:35] team, FYI: sqoop jobs are blocked on big alter table changes in production. Will be stuck for at least a few days. [10:16:50] Yep, I'll ack in Icinga, which will leave it at critical. Have we a ticket number to refer to anywhere, or so we not need one? [10:18:17] btullis: yep: https://phabricator.wikimedia.org/T298560 [10:19:17] ACKNOWLEDGEMENT - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-whole-mediawiki Btullis T298560 - our sqoop operation is currently blocked on an alter table operation https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:18:02] milimetric: eqi an-launcher1002 [11:18:05] woops [11:20:54] hm? [11:22:45] sorry mate - I was reading/investigating the sqoop thing [11:22:52] Thanks a lot for the report! [11:28:20] milimetric: I dropped a line on analytics slack channel for others to know about the sqoop issue [11:28:57] Oh, I thought I had too, sorry [13:25:34] 10Analytics: Large number of web requests from Iran are likely incorrectly flagged with 'user' agent type - https://phabricator.wikimedia.org/T309710 (10Samwalton9) [13:28:17] 10Analytics: Large number of web requests from Iran are likely incorrectly flagged with 'user' agent type - https://phabricator.wikimedia.org/T309710 (10Samwalton9) [13:29:21] 10Analytics: Large number of web requests from Iran are likely incorrectly flagged with 'user' agent type - https://phabricator.wikimedia.org/T309710 (10Samwalton9) [13:56:28] oh! analytics, not data-engineering... my morning brain was stuck in an old timeline, spacetime coordinates refreshed :) [14:14:32] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.505 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [14:30:24] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.1749 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [14:33:51] 10Data-Engineering: Event Utilities partially downloads schemas - https://phabricator.wikimedia.org/T309717 (10Milimetric) [14:40:59] Hello, I forgot my kerberos password which I originally got from this provisioning process (https://phabricator.wikimedia.org/T304361), I have open a request in ( https://phabricator.wikimedia.org/T309608 ) with the tag "Data-Engineering" as suggested in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide#Reset_my_password_for_Kerberos. Does anyone know if the [14:41:01] request is being processed? [14:41:19] 10Data-Engineering, 10Airflow: [Airflow] Migrate Oozie's mediawiki_history_load jobs to Airflow - https://phabricator.wikimedia.org/T309718 (10mforns) [14:44:09] 10Data-Engineering: Kerberos password reset request for sgimeno - https://phabricator.wikimedia.org/T309608 (10Ottomata) Hi! I have reset your password. You should receive an email with instructions. [14:44:19] Hi sergi0 , thanks for the ping. I have reset your password [14:45:11] ottomata: thank you very much! [14:52:55] 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10ntsako) 05Open→03In progress [14:52:57] 10Data-Engineering, 10Equity-Landscape: Extract + Transformation Raw Data into Input Metrics - https://phabricator.wikimedia.org/T306625 (10ntsako) [15:28:17] 10Data-Engineering: Event Utilities partially downloads schemas - https://phabricator.wikimedia.org/T309717 (10Ottomata) Ah okay so the success you see is for the eqiad partition, which is different than the codfw one, which is failing. And indeed, line 214 in [[ https://schema.wikimedia.org/repositories//secon... [15:43:31] 10Data-Engineering, 10Data-Engineering-Kanban: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) a:03BTullis [15:47:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform, 10Patch-For-Review: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate - https://phabricator.wikimedia.org/T308356 (10Ottomata) [15:58:26] 10Analytics, 10Data-Engineering, 10SRE: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10BTullis) I like the look of this task, so I'm going to claim it if noone minds. Predictably enough, I think that we should use MirrorMaker 2 and run it... [15:58:37] 10Data-Engineering, 10Airflow: [Airflow] Migrate Oozie's mediawiki_history_load jobs to Airflow - https://phabricator.wikimedia.org/T309718 (10NOkafor-WMF) a:03NOkafor-WMF [16:40:40] 10Analytics, 10Product-Analytics, 10SDAW-MediaSearch, 10Structured-Data-Backlog (Current Work): [M] No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10CBogen) [16:44:51] 10Analytics, 10Product-Analytics, 10SDAW-MediaSearch, 10Structured-Data-Backlog (Current Work): [M] No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10CBogen) Estimated as a M assuming this is a config we need to find and turn on somewher... [16:46:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) [16:54:23] 10Data-Engineering: Kerberos password reset request for sgimeno - https://phabricator.wikimedia.org/T309608 (10Sgs) 05Open→03Resolved a:03Sgs It worked. Thank you! [17:57:38] (03PS2) 10Snwachukwu: [WIP] Add projectview hql scripts to analytics/refinery/hql path. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) [18:05:09] (03PS4) 10Snwachukwu: Add HQL scripts for wikidata graphite metrics [analytics/refinery] - 10https://gerrit.wikimedia.org/r/791394 (https://phabricator.wikimedia.org/T300021) [18:06:32] (03CR) 10Snwachukwu: Add HQL scripts for wikidata graphite metrics (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/791394 (https://phabricator.wikimedia.org/T300021) (owner: 10Snwachukwu) [18:13:47] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 3 others: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Krinkle) [18:24:43] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/791394 (https://phabricator.wikimedia.org/T300021) (owner: 10Snwachukwu) [18:25:13] (03CR) 10Snwachukwu: [C: 03+1] Add HQL scripts for wikidata graphite metrics [analytics/refinery] - 10https://gerrit.wikimedia.org/r/791394 (https://phabricator.wikimedia.org/T300021) (owner: 10Snwachukwu) [18:32:13] hello btullis or ottomata can you please add SandraEbele to https://gerrit.wikimedia.org/r/admin/groups/d34747bee94be39cff54b5fda1ae36b575107792,members. I think that's the group she needs to be to be able to +2 our repos on Gerrit, no? [18:43:47] Thank you @mforns [18:51:01] !log About to deploy analytics/refinery (regular weekly train) [18:51:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:51:31] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate datahub schema versioning support - https://phabricator.wikimedia.org/T307716 (10Milimetric) [18:51:42] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate datahub schema versioning support - https://phabricator.wikimedia.org/T307716 (10Milimetric) [18:53:48] milimetric and ottomata: q about airflow and kerberos keytab. When testing SimpleSkeinOperator from the development instance, we need to pass the principal and keytab, but a regular user does not have a keytab. Does this mean that we can not test SimpleSkeinOperator with our usernames? [18:56:55] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate interaction of manual description edits and automatic description reimport - https://phabricator.wikimedia.org/T307717 (10EChetty) [19:03:25] 10Data-Engineering: Remove unused Gerrit repository - https://phabricator.wikimedia.org/T309731 (10Eevans) [19:06:09] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10Eevans) [19:18:36] 10Analytics, 10Data-Engineering, 10SRE: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10Ottomata) > Predictably enough, I think that we should use MirrorMaker 2 and run it in k8s on the wikikube clusters :-) This would be awesome. I'd be r... [19:21:02] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10Eevans) Since the purpose of this ticket was to ensure that no existing clients were suddenly caught unawares by the addition of a data center that hadn't been there, and... [19:21:41] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10Eevans) [19:22:45] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [19:22:51] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10Eevans) 05Open→03Resolved a:03Eevans [19:23:12] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [19:25:40] 10Data-Engineering-Kanban, 10Data-Catalog: User Experience: Authentication - https://phabricator.wikimedia.org/T307711 (10Milimetric) Emil is still having a problem authenticating. When he logs in, his username doesn't have the groups that I add for user `echetty`. [19:25:49] mforns: it depencs on what the SimpleSkeinOperator does. [19:26:11] you do not need to pass a keytab if the script it is going to run does not need to use kerberos [19:26:22] if it does, then it has to somehow authenticate with kerberos on the worker [19:26:28] hence, keytab [19:26:38] so yeah you can't use your username if you need to pass a keytab [19:27:02] but anyone in analytics-privatedata-users can sudo -u analytics-privatedata and use the analytics-privatedata keytab and principal [19:27:44] example of an airflow tasks test command I used to do this: [19:28:02] AIRFLOW_HOME=$HOME/airflow-analytics-privatedata HOME=$AIRFLOW_HOME PYTHONPATH=/home/otto/airflow-dags sudo --preserve-env=AIRFLOW_HOME,PYTHONPATH,HOME -u analytics-privatedata kerberos-run-command analytics-privatedata /home/otto/.conda/envs/airflow_development/bin/airflow tasks test spark_conda_test_dag pyspark_tester_local_unpacked_env_skein 2022-01-01 [19:28:36] and then keytab and principal are set to [19:28:38] keytab = '/etc/security/keytabs/analytics-privatedata/analytics-privatedata.keytab', [19:28:38] principal = 'analytics-privatedata/stat1004.eqiad.wmnet@WIKIMEDIA', [19:28:48] (principal changes depending on where you are launching from) [19:29:08] ottomata: no I think I was distracted, we do have the analytics-privatedata keytab, and that one can be used with our usernames, my bad [19:29:43] mforns: i don't seem to have permissions to edit that group membership in gerrit [19:30:00] ottomata: is it a puppet thing? [19:30:06] no i don't think so [19:30:09] maybe a gerrit admin thing? [19:30:15] maybe [19:31:10] mforns: https://phabricator.wikimedia.org/T261443#6481437 [19:31:56] mforns: i think yoy can't access the keytab unless you sudo -u analytics-privatedata [19:32:26] ottomata: oh, of course, but still it would be read-only right? [19:32:46] yes [19:32:52] 👍 [19:33:38] hmm, i wonder if we can/should make it readable by folks in analytics-privatedata-users without sudoing... [19:33:49] i'm not sure [19:35:17] hm [19:35:52] milimetric: I left some more comments on https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/63 [19:36:41] I think the way I describe might be the solution we need. Let me know what you think. Sorry for the config confusion so far! [19:42:23] 10Analytics, 10Data-Engineering, 10SRE: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) >>! In T304373#7974286, @BTullis wrote: > I like the look of this task, so I'm going to claim it if noone minds. Please go right ahead! I am h... [19:50:22] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 (owner: 10Joal) [19:51:04] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Didn't submit, since there's a refinery deployment going on right now." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 (owner: 10Joal) [20:04:00] mforns: commented on your comment :) [20:09:10] !log Successfully deployed refinery using scap, then deployed onto hdfs. [20:09:12] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) Regarding replication strategy, the current state looks like the following: ` CREATE KEYSPACE "local_group_default_T_pageviews_per_project_v2" WITH... [20:09:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:37:19] ottomata: commented on your comment to my comment :] [20:54:05] thanks both of you for the comments, I'll try to get back to them later, just got out of all day meetings [21:02:38] 10Data-Engineering: Move Mediawiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 (10Milimetric) [21:04:02] !log trying to rerun sqoop from a screen on an-launcher [21:04:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:04:20] ^ Amir said this should work now, we'll monitor over the next few days and try something else if it doesn't [21:06:23] if anyone else wants to keep track of it, `milimetric@an-launcher1002:~$ tail -f /var/log/refinery/sqoop-mediawiki.log` [21:06:29] o/ nite yall [22:08:44] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.15; 2022-06-06), and 3 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Zabe) >>! In T233004#7972925, @dom_walden wrote: > @Zabe On my local machine,...