[00:36:53] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:57:26] <wikibugs>	 10Analytics-Radar, 10SRE, 10Traffic, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests - https://phabricator.wikimedia.org/T261803 (10AntiCompositeNumber) It's not just `/static`, JavaScript and CSS...
[07:00:59] <joal>	 wow thanks elukey for the dowtime :S
[07:43:39] <wikibugs>	 10Analytics, 10MediaWiki-REST-API, 10Platform Engineering, 10Platform Team Workboards (Green), 10Story: System administrator reviews API usage by client - https://phabricator.wikimedia.org/T251812 (10Aklapper) Adding #Platform_Engineering as #cpt-prod-green was archived and as open tasks should have an a...
[08:26:10] <btullis>	 Oh dear. Thanks elukey.
[08:28:05] <btullis>	 Two of the cassandra file systems have hit 100% after the latest snapshot loading operation had completed.
[08:32:14] <wikibugs>	 (03CR) 10Phuedx: "See inline. Hopefully my comment makes sense (and uses the correct verbiage!)" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming)
[08:32:44] <wikibugs>	 (03PS1) 10DLynch: talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076)
[08:33:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch)
[08:48:13] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) Two of the volumes have hit 100% of capacity after the most recent loading operation completed. Here is the final o...
[08:54:35] <btullis>	 We need to get these remaining snapshots off the new AQS hosts, on to some other hosts where we can still load them to cassandra. 
[08:55:57] <btullis>	 elukey: What do you think about using the labstore100[67] hosts. These hosts have 18 TB free in /srv/dumps - Is that an option?
[08:57:07] <elukey>	 btullis: no idea, I think that those nodes have all data exposed to the public and it may not be the best place (but wmcs is the best poc, they own the systems)
[08:58:15] <btullis>	 OK, thanks. All oozie jobs to load cassandra data are currently failing and I can't restart these services until I can work out where to put at least 2 x 1.3 TB of data that is on them. I'll ask in #wikimedia-cloud.
[09:00:36] <elukey>	 btullis: there is the option of the stat100x nodes, plenty of space in /srv
[09:02:46] <btullis>	 Ah, that's a good idea. Thanks.
[09:07:36] <joal>	 I was also about to suggest presto nodes - plenty of space not used
[09:08:56] <elukey>	 yep even better
[09:09:04] <elukey>	 an-presto1001 has 22T of space free :D
[09:09:10] <btullis>	 Ah, also good thinking. Ah, it's formatted and mounted as well. Thanks joal: 
[09:10:35] <btullis>	 I can fit in the remaining 6 snapshots in there , but I'll just move the first two and then restart the service and clean up the failed jobs first.
[09:12:08] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I am going to copy the following two directories to an-presto1001.eqiad.wmnet:  * `aqs1012.eqiad.wmnet:/srv/cassand...
[09:16:40] <btullis>	 !log btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/cassandra_migration/aqs1012-b/
[09:16:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:17:41] <btullis>	 !log btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1012-b/
[09:17:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:18:25] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I have begun the first transfer. ` btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/l...
[09:24:01] <btullis>	 aqs1012 is network saturated seding the snapshot, but an-presto1001 has plenty of bandwidth and disk I/O available, so I'll start a second concurrent transfer.
[09:25:14] <btullis>	 !log btullis@cumin1001:~$ sudo transfer.py aqs1013.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1013-b/
[09:25:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:26:14] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) Running a second, concurrent transfer operation: ` btullis@cumin1001:~$ sudo transfer.py aqs1013.eqiad.wmnet:/srv/c...
[09:48:56] <joal>	 heya btullis - let me know when you think the services are happy enough for me to restart loading (I'll do it, I wish to trick and not reload cassandra2, only cassandra3)
[09:49:19] <btullis>	 Thanks joal: Will do.
[09:51:34] <btullis>	 I'll delete the source files and start the services as soon as the two transfers have completed and validated, then let you know.
[09:51:51] <joal>	 ack btullis - Don't we need the other transfers as well
[09:51:53] <joal>	 ?
[09:55:16] <btullis>	 We will need all remaining snapshots, but not before restarting the services. I think that we only need to free up space from the two full volumes. 
[09:55:48] <btullis>	 In fact, I could have simply deleted aqs1012-b because this snapshot completed loading successfully. 
[09:58:51] <btullis>	 My current issue is working out how to continue loading to cassandra from presto1001. By my reading the current firewall rules should allow this, but I can't connect to the cql port at the moment.
[09:58:55] <btullis>	 https://www.irccloud.com/pastebin/XZmEKVwF/
[10:00:41] <joal>	 meh :(
[10:19:17] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10jbond) >>! In T291905#7431136, @elukey wrote: > To recap the next steps: > * Add the cfssl CA cert to the base truststore of all jvms...
[10:20:10] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10jbond) >>! In T291905#7431157, @Joe wrote: > For the record, we've created a `wmf-certificates` debian package that includes the pupp...
[10:24:12] <btullis>	 I am a blithering idiot. I was running that test against the very instance that was not running. This works perfectly.
[10:24:15] <btullis>	 https://www.irccloud.com/pastebin/4FI6k3aG/
[10:24:24] <joal>	 Ah great :)
[10:24:48] <joal>	 And let me disagree on you saying you're an idiot :)
[10:26:51] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Check that change dispatch statistics are present [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329)
[10:26:52] <btullis>	 Too kind :-) So the next issue is simply how to get the cassandra binaries onto an-presto1001 in a non-invasive manner. I can probably extract the .deb package into my home and try that... 
[10:27:31] <joal>	 I wonder if nodetool wouldn't have a package of its own
[10:33:45] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Check that change dispatch statistics are present (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329) (owner: 10Lucas Werkmeister (WMDE))
[10:41:22] <btullis>	 Unfortunately not. Both nodetool and sstableloader are part of the cassandra package.
[10:41:26] <btullis>	 https://www.irccloud.com/pastebin/6T1IE8Le/
[10:47:19] <joal>	 Arf, ok :(
[11:15:20] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I have extracted the cassandra package to `/home/btullis/cassandra/` on an-presto1001 using the commands: ` mkdir c...
[12:06:27] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10jbond) >>! In T291905#7435523, @jbond wrote: >>>! In T291905#7431136, @elukey wrote: >> To recap the next steps...
[12:06:33] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) The two transfers have completed: ` 2021-10-18 09:16:27  ERROR: The specified target path /srv/cassandra_migration/...
[12:09:25] <btullis>	 !log root@aqs1012:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
[12:09:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:09:52] <btullis>	 !log root@aqs1013:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
[12:09:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:10:14] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:10:31] <joal>	 btullis: have you restarted jobs for cassandra lately?
[12:10:56] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:10:58] <btullis>	 No, not the jobs. I've *just* restarted the two services that were not running.
[12:11:12] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:11:14] <joal>	 ack btullis - that's weird, there still are loading jobs ongoign!
[12:11:25] <btullis>	 I was leaving the jobs for you.
[12:11:28] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:11:35] <joal>	 yeah that's good btullis :)
[12:11:36] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:11:46] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:14:03] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) ` root@aqs1012:/srv/cassandra-b/tmp# rm -rf local_group_default_T_* ` ` root@aqs1013:/srv/cassandra-b/tmp# rm -rf l...
[12:15:32] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) However, an error is shown in the logs for aqs1013-b: ` ERROR [main] 2021-10-18 12:11:03,867 LogTransaction.java:49...
[12:27:21] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) The documentation that is referenced says this:    > New transaction log files have been introduced to replace the...
[12:47:21] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I have run the command provided, but all I can get is the same error message. ` root@aqs1013:/srv/cassandra-b/data/...
[13:22:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) I had to remove all four of the REMOVE lines above, before the service would start successfully on aqs1013. i.e. th...
[13:25:47] <btullis>	 joal: That's all of the cassandra services instances running now, although I had to edit transaction logs manually in order to get aqs1013-b to start. :-(
[13:26:03] <btullis>	 What's the status of the loading jobs at the moment?
[13:26:22] <joal>	 btullis: thank you a lot for the detailed follow up on ticket - I was reading as you were providing them
[13:27:01] <joal>	 btullis: I have not restarted anything on loading - We're late but nothinh huge-  just one day
[13:31:03] <btullis>	 Great. Did you find out about the loading jobs that were still ongoing?
[13:44:18] <joal>	 nope btullis :(
[13:47:05] <joal>	 wow - I don't know how btullis, but the loading-job for per-article-flat actually succeeded after you restarted the services! It's been waiting, not failing
[13:47:15] <joal>	 this is kind of unexpected
[13:47:48] <btullis>	 Oh, interesting.
[14:11:14] <wikibugs>	 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey)
[14:11:38] <wikibugs>	 10Analytics-Radar, 10Event-Platform, 10SRE, 10Platform Team Initiatives (Modern Event Platform (TEC2)), 10User-herron: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10elukey)
[14:12:27] <wikibugs>	 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) 05Open→03Resolved
[14:17:50] <wikibugs>	 (03CR) 10Ottomata: Add new scroll schema. (037 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming)
[14:43:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:45:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:46:42] <elukey>	 java.lang.OutOfMemoryError: Java heap space :(
[14:47:17] <elukey>	 some of the logs mention https://yarn.wikimedia.org/cluster/app/application_1633985963344_29442
[14:48:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:49:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:49:25] <elukey>	 !log restart hadoop-yarn-nodemanager on an-worker1119 and an-worker1103 (Java OOM in the logs)
[14:49:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:50:27] <wikibugs>	 10Analytics, 10Epic: Add ability to compare wikis - https://phabricator.wikimedia.org/T283251 (10odimitrijevic)
[14:53:50] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T259105 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/731745
[14:54:09] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T259105 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/731745 (owner: 10GoranSMilovanovic)
[15:09:26] <wikibugs>	 10Analytics, 10Analytics-Dashiki, 10Analytics-Kanban, 10Data-Engineering, 10Developer-Advocacy (Oct-Dec 2021): https://wmcs-edits.wmflabs.org/ not showing time series data since 2020-12-31 - https://phabricator.wikimedia.org/T292871 (10Milimetric) Checked this morning and data is showing up, so all good....
[15:10:08] <wikibugs>	 10Analytics, 10Machine-Learning-Team, 10ORES, 10editquality-modeling, 10artificial-intelligence: Generate dump of scored-revisions from 2018-2020 for Wikis except English Wikipedia - https://phabricator.wikimedia.org/T280107 (10JAllemandou)
[15:10:44] <wikibugs>	 10Analytics, 10Analytics-Kanban: Create monthly job for canonical pageviews - https://phabricator.wikimedia.org/T265732 (10JAllemandou) a:03JAllemandou
[15:12:58] <addshore>	 any idea if gitlab.wikimedia.org requests end up in the webrequest logs?
[15:14:17] <wikibugs>	 10Analytics, 10Analytics-Dashiki, 10Analytics-Kanban, 10Data-Engineering, 10Developer-Advocacy (Oct-Dec 2021): https://wmcs-edits.wmflabs.org/ not showing time series data since 2020-12-31 - https://phabricator.wikimedia.org/T292871 (10Milimetric) 05Open→03Resolved
[15:17:14] <joal>	 !log Rerun failed instances from cassandra-hourly-coord-local_group_default_T_pageviews_per_project_v2
[15:17:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:18:02] <ottomata>	 addshore: if it goes through varnish, then yse...but i think maybr gitlab is hosted in cloud vps?  not sure.
[15:19:15] <ottomata>	 oh, perhaps i'm wrong, it is in prod
[15:19:15] <ottomata>	 ?
[15:19:34] <majavah>	 it's wikimedia.org, so it's not in wmcs
[15:20:32] <Spookreeeno>	 ottomata: gitlab is on the ganeti hosts
[15:21:17] <ottomata>	 huh, still not 100%, but it looks like gitlab hosts have public IPs and is routed to directly via .wikimeida.org???
[15:24:10] <addshore>	 aah right, so no logs, okayys!
[15:24:16] <addshore>	 at least, not via webrequest currently
[15:24:21] <Spookreeeno>	 ottomata: as far as I can see
[15:26:05] <ottomata>	 yeah, dunno why it is done like that, that's a little weird
[15:27:59] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10Ottomata)
[15:40:58] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Wikidata, and 3 others: Add MCR slot information to revision-create events - https://phabricator.wikimedia.org/T293195 (10dcausse) @Cparle I remember you worked on MCR slot filtering on RecentChanges, please let us know if you have suggestions on this ap...
[15:58:12] <wikibugs>	 10Analytics-Radar, 10Product-Analytics, 10wmfdata-python: Consider rewriting wmfdata-python to use omniduct - https://phabricator.wikimedia.org/T275038 (10nshahquinn-wmf) p:05Medium→03Low
[16:13:07] <wikibugs>	 (03PS2) 10DLynch: talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076)
[16:13:40] <wikibugs>	 (03CR) 10DLynch: "I had to fix a bug in jsonschema-tools to work that validation error out. https://github.com/wikimedia/jsonschema-tools/pull/38" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch)
[16:16:39] <joal>	 !log Rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_referer-2021-10-17
[16:16:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:22:17] <joal>	 !log rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-10-17
[16:22:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:30:28] <wikibugs>	 (03CR) 10Phuedx: Add new scroll schema. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming)
[16:31:48] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10SRE Observability (FY2021/2022-Q2), 10User-fgiunchedi: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10odimitrijevic) p:05Triage→03High
[16:32:50] <wikibugs>	 10Analytics, 10Data-Engineering: Allow users to differentiate their JupyterHub logs in Logstash - https://phabricator.wikimedia.org/T293243 (10odimitrijevic) p:05Triage→03Medium
[16:34:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:35:20] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Wikidata, and 3 others: Add MCR slot information to revision-create events - https://phabricator.wikimedia.org/T293195 (10odimitrijevic) ftr MCR stands for Multi Content Revisions.
[16:38:52] <wikibugs>	 (03CR) 10Phuedx: Add new scroll schema. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming)
[16:42:11] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Wikidata, and 3 others: Add MCR slot information to revision-create events - https://phabricator.wikimedia.org/T293195 (10odimitrijevic) p:05Triage→03High
[16:42:26] <wikibugs>	 10Analytics: Kerberos identity for kcv-wikimf - https://phabricator.wikimedia.org/T293189 (10odimitrijevic) p:05Triage→03High
[16:43:57] <wikibugs>	 10Analytics: Kerberos identity for kcv-wikimf - https://phabricator.wikimedia.org/T293189 (10Milimetric) a:03razzi ping @razzi to run the script to add this
[16:44:13] <wikibugs>	 10Analytics: Presto error in Superset - https://phabricator.wikimedia.org/T292879 (10odimitrijevic) p:05Triage→03High a:03razzi
[16:45:39] <milimetric>	 btullis: were you the one working on jaime's username / presto access problems, I think that's what that last issue is about ^ (and I remember someone discussing potential capitalization problems last week)
[16:45:59] <wikibugs>	 10Analytics, 10Analytics-Kanban: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 (10odimitrijevic) p:05Triage→03High
[16:46:54] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Un-fork analytics/gobblin - https://phabricator.wikimedia.org/T292396 (10odimitrijevic) p:05Triage→03Medium
[16:47:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Un-fork analytics/gobblin - https://phabricator.wikimedia.org/T292396 (10odimitrijevic) a:03JAllemandou
[16:49:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10SRE, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10odimitrijevic) p:05Triage→03High
[16:49:45] <wikibugs>	 10Analytics, 10Analytics-Kanban: Automate kerberos credential creation and management to ease the creation of testing infrastructure - https://phabricator.wikimedia.org/T292389 (10odimitrijevic) p:05Triage→03High
[16:50:59] <wikibugs>	 10Analytics: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10odimitrijevic) p:05Triage→03Medium
[16:52:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:57:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:01:09] <elukey>	 this seems to be again https://yarn.wikimedia.org/cluster/app/application_1633985963344_29442
[17:01:34] <elukey>	 joal: --^ (not sure what's happening but I see that spark job in the logs before network errors and OOMs)
[17:06:08] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:09:39] <joal>	 elukey: thanks for the ping - dsaez this job is yours
[17:10:58] <joal>	 actually, it was
[17:23:24] <dsaez>	 thanks joal... I don't know what was hapenning there, It was a very simple query, I've restarted the notebook and everyrthing went smooth 
[17:23:38] <joal>	 mwarf dsaez - weird :S
[17:23:45] <joal>	 thanks for restarting :)
[17:35:14] <addshore>	 ottomata: are https://docker-registry.wikimedia.org/wikimedia/eventgate-wikimedia/tags/ tags jsut commit hashes?
[17:36:02] <ottomata>	 addshore:  i'm not sure, they might be gerrit chnage ids?  they are created as part of the deployment pipeline
[17:36:12] <addshore>	 ack! will dig into that bit then!
[17:36:24] <ottomata>	 ah, i think they are git commit shas
[17:36:26] <ottomata>	 example
[17:36:27] <ottomata>	 https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/722613
[17:36:38] <ottomata>	 See PipelineBot comment
[17:39:23] <dsaez>	 joal, by the way we have a new contractor, user:mnz, she will be working on moving the section alignment code to spark. She is still learning about our cluster and pyspark, just fyi
[17:39:55] <joal>	 ack dsaez thanks for the heasdup :)
[18:04:33] <wikibugs>	 (03CR) 10Ottomata: Add new scroll schema. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming)
[18:05:10] <wikibugs>	 10Analytics: Check home/HDFS leftovers of tonina - https://phabricator.wikimedia.org/T293676 (10MoritzMuehlenhoff)
[18:08:14] <wikibugs>	 10Analytics: Check home/HDFS leftovers of tonina - https://phabricator.wikimedia.org/T293676 (10MoritzMuehlenhoff) Point of contact for any data which might possibly need to be retained is @WMDE-leszek
[18:14:37] <addshore>	 ottomata: another eventlogging question, right now in order to receive all events in a dev setup, does one need both legacy event logging and event gate? or only eventgate?
[18:15:11] <ottomata>	 'all events'?
[18:15:39] <addshore>	 I see things in mw logs like this for example
[18:15:40] <addshore>	 [EventStreamConfig] Stream 'mediawiki.revision-create' does not match any `stream` in stream config
[18:15:48] <addshore>	 would that show up in event get if I configured it more?
[18:16:10] <addshore>	 I also see [EventLogging] wgEventLoggingBaseUri has not been configured., but that only relates to the old one, so not sure if it is needed or if i should ignore the log message
[18:16:11] <ottomata>	 ah, hm, do you have EventBus set up/installed?
[18:16:15] <addshore>	 yup
[18:16:21] <addshore>	 and I see events i fire through JS
[18:16:24] <ottomata>	 that is what is producing that
[18:17:00] <ottomata>	 i think eventgate-wikimedia should allow any event without considering stream config, if stream_config_uri is not set
[18:17:12] <ottomata>	 so, is that revision-create stream config message just a warning?
[18:17:13] <ottomata>	 or info?
[18:18:06] <addshore>	 just an info (I think) but tldr I can see events i make through JS byu hand, but i dont see this revision create event in event gate logs
[18:18:16] <ottomata>	 hm
[18:18:25] <addshore>	 eventgate seems to say `No stream_config_uri was set; events of any $schema will be allowed in any stream.` which looks right
[18:19:06] <addshore>	 For the php side of things I have $wgEventServices = ['*' => [ 'url' => 'http://eventlogging:8192/v1/events' ],];
[18:19:20] <ottomata>	 oh great
[18:19:21] <ottomata>	 seems right
[18:19:23] <addshore>	 now i realize maybe its just the php side that is broken, as JS does direct to the services over http
[18:19:36] <addshore>	 cause the JS side uses $wgEventLoggingServiceUri i guess
[18:19:47] <ottomata>	 if you see that mesage about rev-create from eventgate, that means that eventgate is at least seeing the event
[18:20:08] <addshore>	 so I see that in eventgate php code (the extension) but not in the service
[18:20:23] <ottomata>	 eventgate php code (the extension)  ?
[18:20:23] <ottomata>	 you mean eventbus?
[18:20:43] <addshore>	 yes sorry, `[EventBus]`
[18:20:47] <ottomata>	 oh ok
[18:21:26] <ottomata>	 yeah sounds like an eventbus config issue then...
[18:21:27] <ottomata>	 hm
[18:21:44] <ottomata>	 is mw php able to reach eventgate on http://eventlogging:8192?
[18:21:50] <addshore>	 essentially i am following https://www.mediawiki.org/wiki/MediaWiki-Docker/Configuration_recipes/EventLogging#LocalSettings.php
[18:22:04] <addshore>	 yes, it can see http://eventlogging:8192
[18:22:10] <ottomata>	 i personally have never used eventgate with docker...
[18:22:33] <ottomata>	 that does seem to indicate that it would work
[18:22:55] <addshore>	 right, ill dig around in eventbus config and see if anything pops up
[18:23:07] <ottomata>	 k good luck, thanks
[18:29:48] <addshore>	 I even see EventBusHooks deferred updates firing `[DeferredUpdates] DeferredUpdates::run: ended MWCallableUpdate_MediaWiki\Extension\EventBus\EventBusHooks::sendRevisionCreateEvent #657`
[18:30:17] <ottomata>	 oh maybe eventbus conssults stream config too!
[18:30:24] <addshore>	 `[EventStreamConfig] Stream 'mediawiki.page-move' does not match any `stream` in stream config`
[18:30:29] <addshore>	 `[EventBus] Using EventServiceDefault * for stream mediawiki.page-move. destination_event_service is not configured.`
[18:30:52] <addshore>	 *searchs for stream config docs*
[18:31:26] <ottomata>	 hmm
[18:31:27] <ottomata>	     $wgEventServiceDefault = 'eventgate';
[18:31:36] <ottomata>	 i do'nt know how the *
[18:31:37] <ottomata>	 works
[18:31:41] <ottomata>	 in wgEventServices
[18:31:45] <ottomata>	 what if you set
[18:31:48] <ottomata>	   $wgEventServiceDefault = '*';
[18:31:49] <ottomata>	 ?
[18:31:56] <addshore>	 lemme give it a go
[18:32:17] <ottomata>	 from EventBus README
[18:32:19] <ottomata>	 Per stream configuration via EventStreamConfig is optional.  The default behavior is to
[18:32:19] <ottomata>	 produce all streams to the service specified by `$wgEventServiceDefault`.
[18:32:19] <ottomata>	 You must set `$wgEventServiceDefault` to the an entry in `$wgEventServices` to be
[18:32:19] <ottomata>	 used in case a stream's `destination_event_service` setting is not provided.
[18:33:05] <addshore>	 aaah, so i already have `$wgEventServiceDefault = '*';`
[18:33:09] <ottomata>	 hm
[18:33:27] <addshore>	 I dont get the `destination_event_service is not configured` bit?
[18:33:54] <addshore>	 oh right, so `destination_event_service is not configured` means that the default is used, and in theory i have that configured
[18:34:11] <ottomata>	 		// Use eventServiceDefault if no streamConfigs were provided.
[18:34:11] <ottomata>	 		if ( $this->streamConfigs === null ) {
[18:34:24] <ottomata>	 do you have anything set in wgEventStreams ?
[18:35:12] <addshore>	 no :/
[18:35:51] <ottomata>	 can you add some debug logging in EventBus/ServiceWiring.php and see what is happening for $streamConfigs?
[18:35:57] <addshore>	 yes!
[18:36:04] <ottomata>	 I dunno how you could get that message without wgEventStreams being set
[18:36:30] <joal>	 !log Rerun cassandra-daily-wf-local_group_default_T_unique_devices-2021-10-17
[18:36:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:36:37] <ottomata>	 from reading EventBusFactory::getInstanceForStream
[18:37:14] <addshore>	 https://usercontent.irccloud-cdn.com/file/cPAT7mYu/image.png
[18:38:05] <addshore>	 wait, so that means I do have some stream configs? but they are empty? or? cause that sure isnt null
[18:39:04] <addshore>	 I'm wondering why I see so much `[EventLogging] wgEventLoggingBaseUri has not been configured.` though
[18:39:17] <addshore>	 I guess for the new thing that shouldnt be needed, but maybe it is just noise
[18:39:56] <ottomata>	 i think that might just be noise
[18:40:05] <ottomata>	 but yes, that does seem like somehow it is an empty array1
[18:40:06] <ottomata>	 1
[18:40:08] <ottomata>	 !
[18:40:15] <ottomata>	 what if you set $wgEventStreams = null
[18:40:16] <ottomata>	 ?
[18:40:28] <ottomata>	 oh hmm
[18:40:41] <ottomata>	 hangon, with you shortly, doing a config deploy and testing some things...
[18:41:06] <addshore>	 `[6c3202b9ed478e8182b1be59] /w/index.php?title=User:Saffssaffsasss&action=submit Wikimedia\Assert\ParameterTypeException: Bad value for parameter EventStreams: must be a array`
[18:41:07] <addshore>	 np
[18:41:10] <addshore>	 i'll keep poking
[18:41:52] <ottomata>	 i thnk maybe eventbus has a bug
[18:42:40] <ottomata>	 or
[18:42:48] <ottomata>	 			$streamConfigs = $services->get( 'EventStreamConfig.StreamConfigs' );
[18:42:48] <ottomata>	  should be returning null
[18:42:55] <ottomata>	 if wgEventStreams is not defined
[18:43:07] <ottomata>	 OH
[18:43:08] <ottomata>	 		if ( ExtensionRegistry::getInstance()->isLoaded( 'EventStreamConfig' ) ) {
[18:43:08] <addshore>	 I tried setting $streamConfigs to null after those conditions and still no joy
[18:43:14] <ottomata>	 its because EventStreamConfig is loaded
[18:43:26] * addshore tries turning that odd
[18:43:27] <addshore>	 off
[18:43:55] <ottomata>	 you want 		if ( $this->streamConfigs === null ) {
[18:43:55] <ottomata>	  on line 165 of EventBusFactory to be true
[18:46:26] <addshore>	 right
[18:46:54] <addshore>	 i i commented out a bunch of stuff, and then i see the event https://usercontent.irccloud-cdn.com/file/IEajIMYC/image.png
[18:48:56] <addshore>	 its the very first condition there `if ( !$this->shouldSendEvent( $type ) ) {`
[18:51:13] <addshore>	 https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/master/extension.json#L19-L21
[18:52:05] <ottomata>	 oh ho
[18:52:05] <ottomata>	 hm
[18:52:19] <ottomata>	 i think that might be something that Petr wanted to remove?
[18:52:41] <ottomata>	 then, can you set that to TYPE_EVENT or TYPE_ALL
[18:52:42] <ottomata>	 ?
[18:53:27] <addshore>	 ya, setting to TYPE_ALL makes it all work
[18:53:28] <addshore>	 sweet
[18:53:29] <addshore>	 thanks!
[18:54:31] <ottomata>	 cool!
[18:54:50] <ottomata>	 i should have noticed that, i'm looking at mw vagrant setup and it has that
[18:54:57] <ottomata>	 https://www.irccloud.com/pastebin/oaKOp9x6/
[18:55:09] <addshore>	 once i merge this mwcli will have an out of the box eventlogging setup now :)
[18:55:20] <ottomata>	 cool
[18:55:36] <ottomata>	 it is hard to know where to draw the line, what producers should be enabled by default
[18:55:42] <addshore>	 yup
[18:55:45] <ottomata>	 do you want all instrumentation events?
[18:55:47] <ottomata>	 all mw state change events?
[18:56:05] <ottomata>	 all api log requests?
[19:10:04] <wikibugs>	 (03CR) 10Ppchelko: "Looks OK to me, a couple of bike sheddy comments inlined. But I'm not The Expert in MCR, so I added Daniel as well. He's on vacation unfor" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/731006 (https://phabricator.wikimedia.org/T293195) (owner: 10DCausse)
[19:29:42] <joal>	 !log Rerun cassandra-daily-wf-local_group_default_T_top_pageviews-2021-10-17
[19:29:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:42:08] <ottomata>	 ebernhardson: yt?  i see you wrote WgConfTestCase 6 years ago in mediawiki-config
[19:42:27] <ottomata>	 i'm very confused about how StaticSiteConfiguration is supposed to work
[19:42:38] <ottomata>	 trying to do https://phabricator.wikimedia.org/T277193
[19:42:48] <ottomata>	 thought we had it all worked ouut
[19:42:51] <ottomata>	 but
[19:42:52] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731804
[19:42:57] <ottomata>	 causes https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/8614/console
[19:43:01] <ottomata>	 which is not what I expected
[19:43:13] <ottomata>	 and I can't quite test locally...
[19:43:16] <ottomata>	 maybe i need to figure that out
[20:16:20] <ottomata>	 oh doh, well i dunno how to test locally, but my diff was bad because i was dumb
[20:16:23] <ottomata>	 it works as expected!
[20:16:30] <ottomata>	 the -labs overrides don't though
[20:31:17] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) I think I can handle this just by using an absolute reference to the [file in refinery](https://github.co...
[20:32:50] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) p:05High→03Medium It seems like the priority isn't //that// high since there's a pretty easy workarou...
[20:33:23] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Metrics-Platform, and 2 others: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata) IT WORKS!  https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConf...
[20:43:12] <btullis>	 milimetric: > were you the one working on jaime's username / presto access problems?
[20:43:12] <btullis>	 No, I think that was Razzi who was working on it. Happy to take a look if you think it might help.
[20:43:40] <milimetric>	 ah, ok, np btullis, I pinged razzi on the task
[20:53:45] <wikibugs>	 10Analytics, 10Product-Analytics, 10wmfdata-python: Upstream relevant parts of wmfdata-python into refinery - https://phabricator.wikimedia.org/T293700 (10nshahquinn-wmf)
[20:55:18] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Metrics-Platform, and 2 others: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata) The merging in beta only sort of works.  'default' is not merged, so you can only override setti...
[20:56:20] <wikibugs>	 10Analytics, 10Product-Analytics, 10wmfdata-python: Upstream relevant parts of wmfdata-python into refinery - https://phabricator.wikimedia.org/T293700 (10nshahquinn-wmf)
[20:58:14] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) In summary:  * We have loaded 6 snapshots out of 12 * We have copied 1 of these remaining snapshots to an-presto100...
[21:01:28] <wikibugs>	 10Analytics, 10Product-Analytics, 10wmfdata-python: Upstream relevant parts of wmfdata-python into refinery - https://phabricator.wikimedia.org/T293700 (10Ottomata) I think we'd like to make the python stuff that refinery does now be able to use conda environments.  If we can do that, it would probably be be...
[21:05:41] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) ` btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-a/tmp/local_group_default_T_pageviews_pe...
[21:09:35] <wikibugs>	 (03CR) 10Clare Ming: Add new scroll schema. (036 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming)
[21:17:02] <wikibugs>	 (03CR) 10Clare Ming: Add new scroll schema. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731156 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming)
[22:09:18] <wikibugs>	 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 3 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10Niharika)
[22:24:36] <wikibugs>	 10Analytics, 10Patch-For-Review: Kerberos identity for kcv-wikimf - https://phabricator.wikimedia.org/T293189 (10razzi) 05Open→03Resolved Should be all set. Check your email kcvelaga-ctr@wikimedia.org for further instructions.
[22:25:11] <razzi>	 Ah, just seeing your earlier conversation milimetric 
[22:32:28] <wikibugs>	 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Kerberos identity for kcv-wikimf - https://phabricator.wikimedia.org/T293189 (10Dzahn)
[22:37:48] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10razzi) 05Open→03In progress
[22:58:47] <wikibugs>	 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 3 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10Niharika)
[22:59:30] <wikibugs>	 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 3 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10Niharika) I have updated the task to reflect the latest timelines as published by the Google Chrome team.
[23:24:56] <wikibugs>	 (03PS3) 10DLynch: talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076)