[00:26:08] PROBLEM - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:27:18] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:13:39] 10Analytics, 10Data-Engineering: Update geocode UDF to NOT lookup some addresses - https://phabricator.wikimedia.org/T271340 (10JAllemandou) [08:14:07] 10Analytics, 10Data-Engineering: Update geocode UDF to NOT lookup some addresses - https://phabricator.wikimedia.org/T271340 (10JAllemandou) Description updated with details. This could very well be a startup task. [09:11:16] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10JAllemandou) +1 to try both! [10:00:14] I am going to be scheduling a rolling restart of the aqs_next cassandra services this morning, in support of https://phabricator.wikimedia.org/T297460 [10:00:14] Although I'm going to leave on instance *not-restarted* in order that we can try to observe the memory pressure observed in https://phabricator.wikimedia.org/T298516 and capture a heap dump. [10:00:44] ack btullis :) thank you [10:32:20] https://usercontent.irccloud-cdn.com/file/8X9Q4E0X/image.png [11:39:55] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Observability-Alerting: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10BTullis) a:05BTullis→03None I don't have much more time to investigate this at the moment, so I'll unassigni... [11:46:46] hi folks, qq - do you mind if I downtime + upgrade an-coord1002 to check the hive packages? Just to make sure that the error reproduces in there too [11:47:30] elukey: I have no objection. Feel free to go ahead. [11:48:21] thanks! [11:51:42] I can reproduce, rolling back [11:52:42] Thanks elukey. That's both good and bad news, I suppose :-) [11:56:18] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) 05Open→03Resolved [11:56:25] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-02-03): Adjust edit count bucketing for TemplateData - https://phabricator.wikimedia.org/T272569 (10Lena_WMDE) [11:56:42] btullis: there is definitely something that I am missing, I cannot repro on bigtop's test environment but I tried all the different packages/settings/etc.. [11:58:05] Did you build packages from the bigtop-1.5 branch too? I'm still wondering if it's something that I did wrongly when building these packages. [11:59:51] yes yes those should be fine [12:00:12] from other branches you'd have got a different version [12:03:08] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10Patch-For-Review, 10WMDE-TechWish (Sprint-2021-02-03): Adjust edit count bucketing for TemplateWizard, segment all metrics - https://phabricator.wikimedia.org/T273475 (10Lena_WMDE) [12:03:40] It is very weird, isn't it? Not sure what the next steps should be. [12:05:35] I contacted bigtop and hive upstreams to seek for advices, in the meantime I'll try to reproduce [12:43:21] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) I propose to take aheap dump from aqs1014-b which has been running for 3 weeks and 2 days. I... [13:28:30] going to retry to upgrade an-test-coord1001 folks, errors are mine in case, I'll try to stop timers :) [13:40:42] restoring [14:02:51] anyone got any idea where the code is that takes the webrequest logs and produces the statsv topic? [14:03:08] I'm expecting that over the weekend, we might start seeing AQS/Cassandra related errors on aqs1014-b, as this instance gets into repeated garbage collection loops. Please don't restart the instance if this happens, because I'd like to try to capture a heap dump before doing so. [14:03:29] ack btullis [14:03:43] btullis: would it be worth taking a snapshot now, just in case? [14:04:15] addshore: I have no clue - I think the performance team manages that [14:04:31] addshore: I'm not yet familar with this component, but could it be this? https://gerrit.wikimedia.org/g/analytics/statsv [14:05:30] joal: Yes I think that's a good idea. [14:06:30] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) I will take a heap dump now anyway, in case I can't catch it at a later state. It might be u... [14:11:28] 10Data-Engineering-Kanban, 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) [14:21:49] Hi ottomata - I'd like your opinion on the actions to pursue regarding the refine errors due to schema error - Are we gonna change the table manually and re-refine, or should we refine with drop-malformed? [14:25:03] 10Data-Engineering, 10Data-Engineering-Kanban: Send cassandra3 (new hosts) logs to logstash - https://phabricator.wikimedia.org/T297460 (10BTullis) I have restarted all instances expect aqs1014-b (which is being kept running in support of T298516 - When it is restarted it will pick up the new logging configura... [14:34:08] 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10BTullis) Hello. Sincere apologies for the delay in fixing this. I had hoped to have tested upgraded Log4 versions under Hive to see if it fixes this, but I've run int... [14:38:33] 10Data-Engineering-Kanban, 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) I have created a Kerberos principal for Antoine. ` btullis@krb1001:~$ sudo manage_principals.py get aqu get_principal: P... [14:52:41] !log root@aqs1014:~# jmap -dump:live,format=b,file=/srv/cassandra-b/tmp/aqs1014-b-dump202201071450.hprof 4468 [14:52:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:53:01] Creating a heap dump now from aqs1014-b [14:55:45] Oh. `Unable to open socket file: target process not responding or HotSpot VM not loaded` [14:56:44] 10Analytics-Kanban: Adding aquhen@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T298778 (10Antoine_Quhen) [15:02:30] Better: `Dumping heap to /srv/cassandra-b/tmp/aqs1014-b-dump202201071450.hprof ...` [15:10:10] joal: https://phabricator.wikimedia.org/T298721 [15:10:14] its kind of up to them [15:12:00] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) I have created the heap dump file. I had to chown to the cassandra user, as even root couldn... [15:16:41] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) Compressing the dump file: ` root@aqs1014:/srv/cassandra-b/tmp# tar cjvf aqs1014-b-dump20220... [15:30:03] 10Data-Engineering-Kanban, 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) I believe that this is now complete, but feel free to respond on this ticket Antoine if anything doesn't behave as you'd... [15:33:20] 10Analytics-Radar, 10Revision-Slider, 10WMDE-Analytics-Engineering, 10Patch-For-Review: Data need: Explore range of article revision comparisons - https://phabricator.wikimedia.org/T134861 (10thiemowmde) [15:51:00] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) I have placed the resulting 2.6 GB bzip2 file at `aqs1014.eqiad.wmnet:/home/btullis/aqs1014-... [16:14:16] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) >>! In T298516#7605339, @BTullis wrote: > I have placed the resulting 2.6 GB bzip2 file at `a... [16:21:05] 10Data-Engineering-Kanban, 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10BTullis) @Antoine_Quhen - I notice that you haven't added yourself to the `analytics-admins` group in `data.yaml`, only the `anal... [16:34:00] btullis, joal - I think that I found the issue with hive - /usr/share/java/apache-log4j-extras.jar is installed via liblog4j-extras1.2-java and it ends up in the hive metastore's classpath [16:34:07] with it, I can reproduce the weird error [16:35:06] I installed it at the time (sigh - https://phabricator.wikimedia.org/T276906) to support the rolling file appender [16:35:48] elukey: Fantastic! Great investigative work. [16:36:56] Ah, just a couple of months before I started :-) [16:39:31] :) [16:39:53] it works fine for all hadoop daemons, but Hive loads it as well in its classpath [16:40:02] and with the new changes, it doesn't play well probably [16:40:35] but I am not sure how to exclude it from Hive's classpath at this point [16:43:30] we could add a flag to hadoop-env.sh, to selectively add the jar if needed [16:43:36] on the coordinators etc.. we dont' [16:46:20] 10Data-Engineering-Kanban, 10SRE, 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10odimitrijevic) Approved [16:46:50] yep I confirm that the package works if I comment the line on hadoop-env.sh [16:46:56] \o/ [16:47:04] Yeah, I see what you mean. Do we know *specifically* what it is within that jar that Hive doesn't like? I wonder if another option is slimming down that jar file. [16:47:15] Excellent! [16:49:01] > on the coordinators etc.. we dont' [16:49:01] So the rolling file appender was only added so that we can keep more logs on each of the worker nodes' local disks, without using much more space. Correct? Or was it important for masters as well? [16:53:10] btullis: the jar is brought it by Debian upstream, and it adds some appenders to log4j, that we use on all hadoop daemons to rotate+gzip logs (masters and workers) [16:53:26] it is added to the Hadoop classpath, that hive uses [16:53:49] Hive now uses a more recent log4j 2.x version, that clashes with the extra appenders etc.. for sure [16:54:07] my idea is to have a flag that turns the setting on/off via hiera [16:58:00] elukey: OK, got it. I didn't know that it was from Debian upstream and I didn't know that it was required for everything *except* Hive. What about oozie? That also runs on the coordinators. Does that need access to the same `hadoop-env.sh` file? [16:58:32] it works without it afaics, it doesn't use the log appender [17:02:18] OK, that sounds fine to me then? Shall I make the CR? [17:02:59] already created one, currently testing it, I'll publish it (hopefully) in a couple of min [17:03:17] I am interested to see the client side, if it shows the same issue (I don't think so but better test it :) [17:04:14] I certainly managed to run the hive cli with the upgraded package, but it was only at that point I noticed that the metastore hadn't started. [17:05:08] I'd like to make sure that the hive client running on say an-test-client1001 (we'll have to upgrade hive packages in there too in theory) doesn't cause issues with the hadoop extra log4j flag enabled [17:05:18] I think it is only a datanucleus-related thing for server/meta [17:05:49] but one thing that I learned about hadoop is that hope is always bad :D [17:10:11] btullis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/752171 :) [17:13:34] btullis: thanks, merging :) [17:19:52] btullis: ok all done! hive metastore in hadoop test up and running :) [17:20:25] (going afk, will check later) [17:20:25] Fab! Thanks again. We'll do production on Monday? [17:20:51] btullis: let's rollout, if you are ok, the -3 packages in all hadoop test nodes, and leave it running for one day, just to be sure [17:21:12] we could in theory just upgrade the coordinator, if we wanted [17:21:23] but any new reimaged node would get the new packages [17:21:45] Kudos elukey - awesome troubleshooting! [17:22:04] <3 [17:24:57] OK. I didn't realise that hive packages were also installed on workers, but yep I'll upgrade those too and leave it over the weekend. [17:26:40] and an-testui1001. [17:29:59] !log deployed updated hive packages to an-test-worker100[1-3] and an-test-ui1001 [17:30:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:08:43] 10Data-Engineering, 10Data-Engineering-Kanban: Create Analytics Network Diagram & Documentation - https://phabricator.wikimedia.org/T298577 (10BTullis) This is a representation of the four rows at eqiad, currently showing production hadoop worker and master nodes. It's still from an early draft, but this physi... [18:08:50] I'm off for now folks. Have great weekends. [18:09:00] laters! [20:00:38] 10Data-Engineering, 10Generated Data Platform, 10Platform Engineering, 10SRE: Import Debian package of Cassandra 3.11.11 as 'dev' version - https://phabricator.wikimedia.org/T298805 (10Eevans) [20:14:59] 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) Update: https://gitlab.wikimedia.org/otto/workflow_utils/ - Have been improving and experimenting with `conda-dist` built envs. - I can successfully build conda dist envs wit... [20:16:39] !log altering hive table MobileWikiAppiOSUserHistory field event.device_level_enabled to string - T298721 [20:16:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:16:43] T298721: MobileWikiAppiOSUserHistory sending incompatible data - https://phabricator.wikimedia.org/T298721 [20:19:38] 10Data-Engineering, 10Product-Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Analytics: MobileWikiAppiOSUserHistory sending incompatible data - https://phabricator.wikimedia.org/T298721 (10Ottomata) > Manually alter the Hive table event.device_level_enabled field to a string. This will likely ca... [20:50:04] Hi ottomata you there? I have a question about verifying the ssh key for Sandra's access request https://phabricator.wikimedia.org/maniphest/task/edit/298786/ [21:00:31] 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) Example of launching a custom spark version from a packed conda env using skein: `lang=python myproject_spark_yarn_cluster = skein.Master( resources=skein.Resources(memor... [21:08:08] razzi: hio [21:08:28] wassssup? [21:08:50] Hey hey, the question is: how to validate the ssh key in the access request? Does it need to be uploaded somewhere else? [21:10:09] It says in this footnote on the access request document https://wikitech.wikimedia.org/wiki/SRE/Production_access#cite_note-2 [21:10:09] > You can also put your public key on your wiki user page, in a Phabricator paste, or in a Gerrit patchset you upload, but you can't include it in an email reply to the task. [21:10:35] But isn't putting it in a phabricator paste no more validation than the access request ticket itself? [21:13:26] hmm good q [21:13:41] razzi: i'm guessing that would be needed if the user didn't create the task and put their ssh key [21:13:57] like, if a manager created the task and the user needed to put the ssh key somewhere [21:14:04] although, i'm not totally sure [21:14:07] ok yeah [21:14:18] let's ask daniel zahn! [21:14:29] in #sre ? [21:15:23] ya [22:37:38] 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) End of day update: I have been successful in automating the generation of different types of conda dist envs that use spark: - Custom python & pyspark versions - Custom pyth...