[00:36:53] RECOVERY - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:57:26] 10Analytics-Radar, 10SRE, 10Traffic, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10AntiCompositeNumber) It's not just `/static`, JavaScript and CSS... [07:00:59] wow thanks elukey for the dowtime :S [07:43:39] 10Analytics, 10MediaWiki-REST-API, 10Platform Engineering, 10Platform Team Workboards (Green), 10Story: System administrator reviews API usage by client - https://phabricator.wikimedia.org/T251812 (10Aklapper) Adding #Platform_Engineering as #cpt-prod-green was archived and as open tasks should have an a... [08:26:10] Oh dear. Thanks elukey. [08:28:05] Two of the cassandra file systems have hit 100% after the latest snapshot loading operation had completed. [08:32:14] (03CR) 10Phuedx: "See inline. Hopefully my comment makes sense (and uses the correct verbiage!)" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [08:32:44] (03PS1) 10DLynch: talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) [08:33:20] (03CR) 10jerkins-bot: [V: 04-1] talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [08:48:13] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) Two of the volumes have hit 100% of capacity after the most recent loading operation completed. Here is the final o... [08:54:35] We need to get these remaining snapshots off the new AQS hosts, on to some other hosts where we can still load them to cassandra. [08:55:57] elukey: What do you think about using the labstore100[67] hosts. These hosts have 18 TB free in /srv/dumps - Is that an option? [08:57:07] btullis: no idea, I think that those nodes have all data exposed to the public and it may not be the best place (but wmcs is the best poc, they own the systems) [08:58:15] OK, thanks. All oozie jobs to load cassandra data are currently failing and I can't restart these services until I can work out where to put at least 2 x 1.3 TB of data that is on them. I'll ask in #wikimedia-cloud. [09:00:36] btullis: there is the option of the stat100x nodes, plenty of space in /srv [09:02:46] Ah, that's a good idea. Thanks. [09:07:36] I was also about to suggest presto nodes - plenty of space not used [09:08:56] yep even better [09:09:04] an-presto1001 has 22T of space free :D [09:09:10] Ah, also good thinking. Ah, it's formatted and mounted as well. Thanks joal: [09:10:35] I can fit in the remaining 6 snapshots in there , but I'll just move the first two and then restart the service and clean up the failed jobs first. [09:12:08] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I am going to copy the following two directories to an-presto1001.eqiad.wmnet: * `aqs1012.eqiad.wmnet:/srv/cassand... [09:16:40] !log btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/cassandra_migration/aqs1012-b/ [09:16:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:17:41] !log btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1012-b/ [09:17:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:18:25] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I have begun the first transfer. ` btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/l... [09:24:01] aqs1012 is network saturated seding the snapshot, but an-presto1001 has plenty of bandwidth and disk I/O available, so I'll start a second concurrent transfer. [09:25:14] !log btullis@cumin1001:~$ sudo transfer.py aqs1013.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1013-b/ [09:25:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:26:14] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) Running a second, concurrent transfer operation: ` btullis@cumin1001:~$ sudo transfer.py aqs1013.eqiad.wmnet:/srv/c... [09:48:56] heya btullis - let me know when you think the services are happy enough for me to restart loading (I'll do it, I wish to trick and not reload cassandra2, only cassandra3) [09:49:19] Thanks joal: Will do. [09:51:34] I'll delete the source files and start the services as soon as the two transfers have completed and validated, then let you know. [09:51:51] ack btullis - Don't we need the other transfers as well [09:51:53] ? [09:55:16] We will need all remaining snapshots, but not before restarting the services. I think that we only need to free up space from the two full volumes. [09:55:48] In fact, I could have simply deleted aqs1012-b because this snapshot completed loading successfully. [09:58:51] My current issue is working out how to continue loading to cassandra from presto1001. By my reading the current firewall rules should allow this, but I can't connect to the cql port at the moment. [09:58:55] https://www.irccloud.com/pastebin/XZmEKVwF/ [10:00:41] meh :( [10:19:17] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10jbond) >>! In T291905#7431136, @elukey wrote: > To recap the next steps: > * Add the cfssl CA cert to the base truststore of all jvms... [10:20:10] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10jbond) >>! In T291905#7431157, @Joe wrote: > For the record, we've created a `wmf-certificates` debian package that includes the pupp... [10:24:12] I am a blithering idiot. I was running that test against the very instance that was not running. This works perfectly. [10:24:15] https://www.irccloud.com/pastebin/4FI6k3aG/ [10:24:24] Ah great :) [10:24:48] And let me disagree on you saying you're an idiot :) [10:26:51] (03PS1) 10Lucas Werkmeister (WMDE): Check that change dispatch statistics are present [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329) [10:26:52] Too kind :-) So the next issue is simply how to get the cassandra binaries onto an-presto1001 in a non-invasive manner. I can probably extract the .deb package into my home and try that... [10:27:31] I wonder if nodetool wouldn't have a package of its own [10:33:45] (03CR) 10Lucas Werkmeister (WMDE): Check that change dispatch statistics are present (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329) (owner: 10Lucas Werkmeister (WMDE)) [10:41:22] Unfortunately not. Both nodetool and sstableloader are part of the cassandra package. [10:41:26] https://www.irccloud.com/pastebin/6T1IE8Le/ [10:47:19] Arf, ok :( [11:15:20] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I have extracted the cassandra package to `/home/btullis/cassandra/` on an-presto1001 using the commands: ` mkdir c... [12:06:27] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10jbond) >>! In T291905#7435523, @jbond wrote: >>>! In T291905#7431136, @elukey wrote: >> To recap the next steps... [12:06:33] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) The two transfers have completed: ` 2021-10-18 09:16:27 ERROR: The specified target path /srv/cassandra_migration/... [12:09:25] !log root@aqs1012:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service [12:09:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:09:52] !log root@aqs1013:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service [12:09:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:10:14] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:10:31] btullis: have you restarted jobs for cassandra lately? [12:10:56] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:10:58] No, not the jobs. I've *just* restarted the two services that were not running. [12:11:12] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:11:14] ack btullis - that's weird, there still are loading jobs ongoign! [12:11:25] I was leaving the jobs for you. [12:11:28] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:11:35] yeah that's good btullis :) [12:11:36] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:11:46] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:14:03] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) ` root@aqs1012:/srv/cassandra-b/tmp# rm -rf local_group_default_T_* ` ` root@aqs1013:/srv/cassandra-b/tmp# rm -rf l... [12:15:32] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) However, an error is shown in the logs for aqs1013-b: ` ERROR [main] 2021-10-18 12:11:03,867 LogTransaction.java:49... [12:27:21] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) The documentation that is referenced says this: > New transaction log files have been introduced to replace the... [12:47:21] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I have run the command provided, but all I can get is the same error message. ` root@aqs1013:/srv/cassandra-b/data/... [13:22:34] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I had to remove all four of the REMOVE lines above, before the service would start successfully on aqs1013. i.e. th... [13:25:47] joal: That's all of the cassandra services instances running now, although I had to edit transaction logs manually in order to get aqs1013-b to start. :-( [13:26:03] What's the status of the loading jobs at the moment? [13:26:22] btullis: thank you a lot for the detailed follow up on ticket - I was reading as you were providing them [13:27:01] btullis: I have not restarted anything on loading - We're late but nothinh huge- just one day [13:31:03] Great. Did you find out about the loading jobs that were still ongoing? [13:44:18] nope btullis :( [13:47:05] wow - I don't know how btullis, but the loading-job for per-article-flat actually succeeded after you restarted the services! It's been waiting, not failing [13:47:15] this is kind of unexpected [13:47:48] Oh, interesting. [14:11:14] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) [14:11:38] 10Analytics-Radar, 10Event-Platform, 10SRE, 10Platform Team Initiatives (Modern Event Platform (TEC2)), 10User-herron: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10elukey) [14:12:27] 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) 05Open→03Resolved [14:17:50] (03CR) 10Ottomata: Add new scroll schema. (037 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [14:43:49] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:45:02] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:46:42] java.lang.OutOfMemoryError: Java heap space :( [14:47:17] some of the logs mention https://yarn.wikimedia.org/cluster/app/application_1633985963344_29442 [14:48:42] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:49:14] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:49:25] !log restart hadoop-yarn-nodemanager on an-worker1119 and an-worker1103 (Java OOM in the logs) [14:49:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:50:27] 10Analytics, 10Epic: Add ability to compare wikis - https://phabricator.wikimedia.org/T283251 (10odimitrijevic) [14:53:50] (03PS1) 10GoranSMilovanovic: T259105 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/731745 [14:54:09] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T259105 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/731745 (owner: 10GoranSMilovanovic) [15:09:26] 10Analytics, 10Analytics-Dashiki, 10Analytics-Kanban, 10Data-Engineering, 10Developer-Advocacy (Oct-Dec 2021): https://wmcs-edits.wmflabs.org/ not showing time series data since 2020-12-31 - https://phabricator.wikimedia.org/T292871 (10Milimetric) Checked this morning and data is showing up, so all good.... [15:10:08] 10Analytics, 10Machine-Learning-Team, 10ORES, 10editquality-modeling, 10artificial-intelligence: Generate dump of scored-revisions from 2018-2020 for Wikis except English Wikipedia - https://phabricator.wikimedia.org/T280107 (10JAllemandou) [15:10:44] 10Analytics, 10Analytics-Kanban: Create monthly job for canonical pageviews - https://phabricator.wikimedia.org/T265732 (10JAllemandou) a:03JAllemandou [15:12:58] any idea if gitlab.wikimedia.org requests end up in the webrequest logs? [15:14:17] 10Analytics, 10Analytics-Dashiki, 10Analytics-Kanban, 10Data-Engineering, 10Developer-Advocacy (Oct-Dec 2021): https://wmcs-edits.wmflabs.org/ not showing time series data since 2020-12-31 - https://phabricator.wikimedia.org/T292871 (10Milimetric) 05Open→03Resolved [15:17:14] !log Rerun failed instances from cassandra-hourly-coord-local_group_default_T_pageviews_per_project_v2 [15:17:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:18:02] addshore: if it goes through varnish, then yse...but i think maybr gitlab is hosted in cloud vps? not sure. [15:19:15] oh, perhaps i'm wrong, it is in prod [15:19:15] ? [15:19:34] it's wikimedia.org, so it's not in wmcs [15:20:32] ottomata: gitlab is on the ganeti hosts [15:21:17] huh, still not 100%, but it looks like gitlab hosts have public IPs and is routed to directly via .wikimeida.org??? [15:24:10] aah right, so no logs, okayys! [15:24:16] at least, not via webrequest currently [15:24:21] ottomata: as far as I can see [15:26:05] yeah, dunno why it is done like that, that's a little weird [15:27:59] 10Analytics-Clusters, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10Ottomata) [15:40:58] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Wikidata, and 3 others: Add MCR slot information to revision-create events - https://phabricator.wikimedia.org/T293195 (10dcausse) @Cparle I remember you worked on MCR slot filtering on RecentChanges, please let us know if you have suggestions on this ap... [15:58:12] 10Analytics-Radar, 10Product-Analytics, 10wmfdata-python: Consider rewriting wmfdata-python to use omniduct - https://phabricator.wikimedia.org/T275038 (10nshahquinn-wmf) p:05Medium→03Low [16:13:07] (03PS2) 10DLynch: talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) [16:13:40] (03CR) 10DLynch: "I had to fix a bug in jsonschema-tools to work that validation error out. https://github.com/wikimedia/jsonschema-tools/pull/38" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [16:16:39] !log Rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_referer-2021-10-17 [16:16:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:22:17] !log rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-10-17 [16:22:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:30:28] (03CR) 10Phuedx: Add new scroll schema. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [16:31:48] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10SRE Observability (FY2021/2022-Q2), 10User-fgiunchedi: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10odimitrijevic) p:05Triage→03High [16:32:50] 10Analytics, 10Data-Engineering: Allow users to differentiate their JupyterHub logs in Logstash - https://phabricator.wikimedia.org/T293243 (10odimitrijevic) p:05Triage→03Medium [16:34:22] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:35:20] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Wikidata, and 3 others: Add MCR slot information to revision-create events - https://phabricator.wikimedia.org/T293195 (10odimitrijevic) ftr MCR stands for Multi Content Revisions. [16:38:52] (03CR) 10Phuedx: Add new scroll schema. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [16:42:11] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Wikidata, and 3 others: Add MCR slot information to revision-create events - https://phabricator.wikimedia.org/T293195 (10odimitrijevic) p:05Triage→03High [16:42:26] 10Analytics: Kerberos identity for kcv-wikimf - https://phabricator.wikimedia.org/T293189 (10odimitrijevic) p:05Triage→03High [16:43:57] 10Analytics: Kerberos identity for kcv-wikimf - https://phabricator.wikimedia.org/T293189 (10Milimetric) a:03razzi ping @razzi to run the script to add this [16:44:13] 10Analytics: Presto error in Superset - https://phabricator.wikimedia.org/T292879 (10odimitrijevic) p:05Triage→03High a:03razzi [16:45:39] btullis: were you the one working on jaime's username / presto access problems, I think that's what that last issue is about ^ (and I remember someone discussing potential capitalization problems last week) [16:45:59] 10Analytics, 10Analytics-Kanban: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 (10odimitrijevic) p:05Triage→03High [16:46:54] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Un-fork analytics/gobblin - https://phabricator.wikimedia.org/T292396 (10odimitrijevic) p:05Triage→03Medium [16:47:12] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Un-fork analytics/gobblin - https://phabricator.wikimedia.org/T292396 (10odimitrijevic) a:03JAllemandou [16:49:00] 10Analytics, 10Analytics-Kanban, 10SRE, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10odimitrijevic) p:05Triage→03High [16:49:45] 10Analytics, 10Analytics-Kanban: Automate kerberos credential creation and management to ease the creation of testing infrastructure - https://phabricator.wikimedia.org/T292389 (10odimitrijevic) p:05Triage→03High [16:50:59] 10Analytics: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10odimitrijevic) p:05Triage→03Medium [16:52:42] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:57:58] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:01:09] this seems to be again https://yarn.wikimedia.org/cluster/app/application_1633985963344_29442 [17:01:34] joal: --^ (not sure what's happening but I see that spark job in the logs before network errors and OOMs) [17:06:08] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:09:39] elukey: thanks for the ping - dsaez this job is yours [17:10:58] actually, it was [17:23:24] thanks joal... I don't know what was hapenning there, It was a very simple query, I've restarted the notebook and everyrthing went smooth [17:23:38] mwarf dsaez - weird :S [17:23:45] thanks for restarting :) [17:35:14] ottomata: are https://docker-registry.wikimedia.org/wikimedia/eventgate-wikimedia/tags/ tags jsut commit hashes? [17:36:02] addshore: i'm not sure, they might be gerrit chnage ids? they are created as part of the deployment pipeline [17:36:12] ack! will dig into that bit then! [17:36:24] ah, i think they are git commit shas [17:36:26] example [17:36:27] https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/722613 [17:36:38] See PipelineBot comment [17:39:23] joal, by the way we have a new contractor, user:mnz, she will be working on moving the section alignment code to spark. She is still learning about our cluster and pyspark, just fyi [17:39:55] ack dsaez thanks for the heasdup :) [18:04:33] (03CR) 10Ottomata: Add new scroll schema. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [18:05:10] 10Analytics: Check home/HDFS leftovers of tonina - https://phabricator.wikimedia.org/T293676 (10MoritzMuehlenhoff) [18:08:14] 10Analytics: Check home/HDFS leftovers of tonina - https://phabricator.wikimedia.org/T293676 (10MoritzMuehlenhoff) Point of contact for any data which might possibly need to be retained is @WMDE-leszek [18:14:37] ottomata: another eventlogging question, right now in order to receive all events in a dev setup, does one need both legacy event logging and event gate? or only eventgate? [18:15:11] 'all events'? [18:15:39] I see things in mw logs like this for example [18:15:40] [EventStreamConfig] Stream 'mediawiki.revision-create' does not match any `stream` in stream config [18:15:48] would that show up in event get if I configured it more? [18:16:10] I also see [EventLogging] wgEventLoggingBaseUri has not been configured., but that only relates to the old one, so not sure if it is needed or if i should ignore the log message [18:16:11] ah, hm, do you have EventBus set up/installed? [18:16:15] yup [18:16:21] and I see events i fire through JS [18:16:24] that is what is producing that [18:17:00] i think eventgate-wikimedia should allow any event without considering stream config, if stream_config_uri is not set [18:17:12] so, is that revision-create stream config message just a warning? [18:17:13] or info? [18:18:06] just an info (I think) but tldr I can see events i make through JS byu hand, but i dont see this revision create event in event gate logs [18:18:16] hm [18:18:25] eventgate seems to say `No stream_config_uri was set; events of any $schema will be allowed in any stream.` which looks right [18:19:06] For the php side of things I have $wgEventServices = ['*' => [ 'url' => 'http://eventlogging:8192/v1/events' ],]; [18:19:20] oh great [18:19:21] seems right [18:19:23] now i realize maybe its just the php side that is broken, as JS does direct to the services over http [18:19:36] cause the JS side uses $wgEventLoggingServiceUri i guess [18:19:47] if you see that mesage about rev-create from eventgate, that means that eventgate is at least seeing the event [18:20:08] so I see that in eventgate php code (the extension) but not in the service [18:20:23] eventgate php code (the extension) ? [18:20:23] you mean eventbus? [18:20:43] yes sorry, `[EventBus]` [18:20:47] oh ok [18:21:26] yeah sounds like an eventbus config issue then... [18:21:27] hm [18:21:44] is mw php able to reach eventgate on http://eventlogging:8192? [18:21:50] essentially i am following https://www.mediawiki.org/wiki/MediaWiki-Docker/Configuration_recipes/EventLogging#LocalSettings.php [18:22:04] yes, it can see http://eventlogging:8192 [18:22:10] i personally have never used eventgate with docker... [18:22:33] that does seem to indicate that it would work [18:22:55] right, ill dig around in eventbus config and see if anything pops up [18:23:07] k good luck, thanks [18:29:48] I even see EventBusHooks deferred updates firing `[DeferredUpdates] DeferredUpdates::run: ended MWCallableUpdate_MediaWiki\Extension\EventBus\EventBusHooks::sendRevisionCreateEvent #657` [18:30:17] oh maybe eventbus conssults stream config too! [18:30:24] `[EventStreamConfig] Stream 'mediawiki.page-move' does not match any `stream` in stream config` [18:30:29] `[EventBus] Using EventServiceDefault * for stream mediawiki.page-move. destination_event_service is not configured.` [18:30:52] *searchs for stream config docs* [18:31:26] hmm [18:31:27] $wgEventServiceDefault = 'eventgate'; [18:31:36] i do'nt know how the * [18:31:37] works [18:31:41] in wgEventServices [18:31:45] what if you set [18:31:48] $wgEventServiceDefault = '*'; [18:31:49] ? [18:31:56] lemme give it a go [18:32:17] from EventBus README [18:32:19] Per stream configuration via EventStreamConfig is optional. The default behavior is to [18:32:19] produce all streams to the service specified by `$wgEventServiceDefault`. [18:32:19] You must set `$wgEventServiceDefault` to the an entry in `$wgEventServices` to be [18:32:19] used in case a stream's `destination_event_service` setting is not provided. [18:33:05] aaah, so i already have `$wgEventServiceDefault = '*';` [18:33:09] hm [18:33:27] I dont get the `destination_event_service is not configured` bit? [18:33:54] oh right, so `destination_event_service is not configured` means that the default is used, and in theory i have that configured [18:34:11] // Use eventServiceDefault if no streamConfigs were provided. [18:34:11] if ( $this->streamConfigs === null ) { [18:34:24] do you have anything set in wgEventStreams ? [18:35:12] no :/ [18:35:51] can you add some debug logging in EventBus/ServiceWiring.php and see what is happening for $streamConfigs? [18:35:57] yes! [18:36:04] I dunno how you could get that message without wgEventStreams being set [18:36:30] !log Rerun cassandra-daily-wf-local_group_default_T_unique_devices-2021-10-17 [18:36:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:36:37] from reading EventBusFactory::getInstanceForStream [18:37:14] https://usercontent.irccloud-cdn.com/file/cPAT7mYu/image.png [18:38:05] wait, so that means I do have some stream configs? but they are empty? or? cause that sure isnt null [18:39:04] I'm wondering why I see so much `[EventLogging] wgEventLoggingBaseUri has not been configured.` though [18:39:17] I guess for the new thing that shouldnt be needed, but maybe it is just noise [18:39:56] i think that might just be noise [18:40:05] but yes, that does seem like somehow it is an empty array1 [18:40:06] 1 [18:40:08] ! [18:40:15] what if you set $wgEventStreams = null [18:40:16] ? [18:40:28] oh hmm [18:40:41] hangon, with you shortly, doing a config deploy and testing some things... [18:41:06] `[6c3202b9ed478e8182b1be59] /w/index.php?title=User:Saffssaffsasss&action=submit Wikimedia\Assert\ParameterTypeException: Bad value for parameter EventStreams: must be a array` [18:41:07] np [18:41:10] i'll keep poking [18:41:52] i thnk maybe eventbus has a bug [18:42:40] or [18:42:48] $streamConfigs = $services->get( 'EventStreamConfig.StreamConfigs' ); [18:42:48] should be returning null [18:42:55] if wgEventStreams is not defined [18:43:07] OH [18:43:08] if ( ExtensionRegistry::getInstance()->isLoaded( 'EventStreamConfig' ) ) { [18:43:08] I tried setting $streamConfigs to null after those conditions and still no joy [18:43:14] its because EventStreamConfig is loaded [18:43:26] * addshore tries turning that odd [18:43:27] off [18:43:55] you want if ( $this->streamConfigs === null ) { [18:43:55] on line 165 of EventBusFactory to be true [18:46:26] right [18:46:54] i i commented out a bunch of stuff, and then i see the event https://usercontent.irccloud-cdn.com/file/IEajIMYC/image.png [18:48:56] its the very first condition there `if ( !$this->shouldSendEvent( $type ) ) {` [18:51:13] https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/master/extension.json#L19-L21 [18:52:05] oh ho [18:52:05] hm [18:52:19] i think that might be something that Petr wanted to remove? [18:52:41] then, can you set that to TYPE_EVENT or TYPE_ALL [18:52:42] ? [18:53:27] ya, setting to TYPE_ALL makes it all work [18:53:28] sweet [18:53:29] thanks! [18:54:31] cool! [18:54:50] i should have noticed that, i'm looking at mw vagrant setup and it has that [18:54:57] https://www.irccloud.com/pastebin/oaKOp9x6/ [18:55:09] once i merge this mwcli will have an out of the box eventlogging setup now :) [18:55:20] cool [18:55:36] it is hard to know where to draw the line, what producers should be enabled by default [18:55:42] yup [18:55:45] do you want all instrumentation events? [18:55:47] all mw state change events? [18:56:05] all api log requests? [19:10:04] (03CR) 10Ppchelko: "Looks OK to me, a couple of bike sheddy comments inlined. But I'm not The Expert in MCR, so I added Daniel as well. He's on vacation unfor" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/731006 (https://phabricator.wikimedia.org/T293195) (owner: 10DCausse) [19:29:42] !log Rerun cassandra-daily-wf-local_group_default_T_top_pageviews-2021-10-17 [19:29:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:42:08] ebernhardson: yt? i see you wrote WgConfTestCase 6 years ago in mediawiki-config [19:42:27] i'm very confused about how StaticSiteConfiguration is supposed to work [19:42:38] trying to do https://phabricator.wikimedia.org/T277193 [19:42:48] thought we had it all worked ouut [19:42:51] but [19:42:52] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731804 [19:42:57] causes https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/8614/console [19:43:01] which is not what I expected [19:43:13] and I can't quite test locally... [19:43:16] maybe i need to figure that out [20:16:20] oh doh, well i dunno how to test locally, but my diff was bad because i was dumb [20:16:23] it works as expected! [20:16:30] the -labs overrides don't though [20:31:17] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) I think I can handle this just by using an absolute reference to the [file in refinery](https://github.co... [20:32:50] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) p:05High→03Medium It seems like the priority isn't //that// high since there's a pretty easy workarou... [20:33:23] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Metrics-Platform, and 2 others: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata) IT WORKS! https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConf... [20:43:12] milimetric: > were you the one working on jaime's username / presto access problems? [20:43:12] No, I think that was Razzi who was working on it. Happy to take a look if you think it might help. [20:43:40] ah, ok, np btullis, I pinged razzi on the task [20:53:45] 10Analytics, 10Product-Analytics, 10wmfdata-python: Upstream relevant parts of wmfdata-python into refinery - https://phabricator.wikimedia.org/T293700 (10nshahquinn-wmf) [20:55:18] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Metrics-Platform, and 2 others: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata) The merging in beta only sort of works. 'default' is not merged, so you can only override setti... [20:56:20] 10Analytics, 10Product-Analytics, 10wmfdata-python: Upstream relevant parts of wmfdata-python into refinery - https://phabricator.wikimedia.org/T293700 (10nshahquinn-wmf) [20:58:14] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) In summary: * We have loaded 6 snapshots out of 12 * We have copied 1 of these remaining snapshots to an-presto100... [21:01:28] 10Analytics, 10Product-Analytics, 10wmfdata-python: Upstream relevant parts of wmfdata-python into refinery - https://phabricator.wikimedia.org/T293700 (10Ottomata) I think we'd like to make the python stuff that refinery does now be able to use conda environments. If we can do that, it would probably be be... [21:05:41] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) ` btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-a/tmp/local_group_default_T_pageviews_pe... [21:09:35] (03CR) 10Clare Ming: Add new scroll schema. (036 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [21:17:02] (03CR) 10Clare Ming: Add new scroll schema. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [22:09:18] 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 3 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10Niharika) [22:24:36] 10Analytics, 10Patch-For-Review: Kerberos identity for kcv-wikimf - https://phabricator.wikimedia.org/T293189 (10razzi) 05Open→03Resolved Should be all set. Check your email kcvelaga-ctr@wikimedia.org for further instructions. [22:25:11] Ah, just seeing your earlier conversation milimetric [22:32:28] 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Kerberos identity for kcv-wikimf - https://phabricator.wikimedia.org/T293189 (10Dzahn) [22:37:48] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10razzi) 05Open→03In progress [22:58:47] 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 3 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10Niharika) [22:59:30] 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 3 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10Niharika) I have updated the task to reflect the latest timelines as published by the Google Chrome team. [23:24:56] (03PS3) 10DLynch: talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076)