[00:28:48] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-bscarone-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:09] 10Analytics, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10BPirkle) Thank you @JAllemandou , we may need to take you up on that offer. For now, can you check for any major flaws in my thinking? # I'm still pursuing the possibility of test... [06:40:47] (03CR) 10Joal: "One thing again, and ask for double check of results being the same for the new CSV format and the old way :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [08:25:27] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I'm going to try updating the RAID controller firmware, then the BIOS on stat1010, to see if either of these fixes the drive ordering issue.... [08:27:43] FYI, I'm going to reboot the host running Turnilo in 5m [08:28:26] Ack. Many thanks moritzm. What's the reason, out of interest? [08:28:39] for https://phabricator.wikimedia.org/T310483 [08:29:11] 👍 [08:56:28] 10Analytics-Radar, 10DBA: mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking - https://phabricator.wikimedia.org/T270112 (10BTullis) [08:56:31] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation): dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @Ladsgroup for... [08:59:58] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: [Shared Event Platform] Produce new mediawii.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10gmodena) [09:19:26] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation): dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10jcrespo) While searching for other things on MariaDB's JIRA, I saw... [09:34:40] 10Data-Engineering, 10Product-Analytics, 10SDAW-MediaSearch, 10Structured-Data-Backlog (Current Work): [M] No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10mfossati) 05Open→03In progress [11:47:35] 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10hashar) [11:49:10] I am about to attempt another failback of the namenode service from an-master1002 to an-master1001. [11:50:15] No gobblin jobs are running and it is a relatively quiet time on the cluster. [11:50:27] Happy to help [11:50:36] btullis: I'll be monitoring with you if you wish [11:50:58] Great, thanks. Yes please. [11:51:58] !log btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [11:52:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:52:28] https://www.irccloud.com/pastebin/hIPbnvrP/ [11:52:34] Looks OK so far. [11:53:47] btullis: successful change (from zkfc log) [11:54:13] Yep, looks that way to me too. The command returned successfully too. [11:54:16] https://www.irccloud.com/pastebin/6tz7bVRQ/ [11:55:20] that's great :) [11:55:49] now this doesn't explain why we failed previously, but at least that success makes me happy [11:56:26] Yes. Phew. It does gives weight to the hypothesis that it is related to load on the cluster at the time of the failover. [11:56:37] yes indeed [11:56:38] Doesn't it? [11:57:16] Now, I will also want to do a restart of the namenode on an-master1002 at some point, so that it picks up its increased heap size, but I'm not in any hurry to do that. [11:57:42] hm, I'm less afraid of restarts than failover nowadays :) [11:58:14] I'd let's give master1 some time, just to be sure, and if you wish you can restart namenode2 then [11:58:30] And actually, the cluster was really quiet - almost no job :) [11:58:32] Yes, I suggest that we also look at scheduling it for a period of low activity and try to measure the impact of a namenode restart. [11:58:44] works for me [11:59:01] 👍 Will do. [11:59:12] I'm not sure I understand the notion of impact of namenode restart when it's passive though [11:59:18] but that's ok :) [12:04:58] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:05] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10Ottomata) [12:09:48] 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) We decided to attempt another failback from an-master1002 to an-master1001. This time we wanted to time it so that: * No gobblin jobs were running because of {T311263} * It was... [12:17:05] 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10Ottomata) Very cool! So, Event Platform is opinionated about some of the fields in the schema. Some are required to automate validation and ingestion. See [[ https://wikitech.wikimedia.org/wiki/E... [12:18:07] 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10Ottomata) NICE! That is good news. It will be interesting to test a failover during high cluster activity after we icebergify too! [12:19:42] 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10hashar) [12:24:26] heya team! yesterday I tried to deploy refinery and it failed, I will resume now and try again. [12:25:22] Ack. Thanks mforns. Happy to help if there's any way I can. [12:25:34] ok, thanks btullis! :] [12:26:47] 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10hashar) Thanks for the links, I haven't looked yet how to format the payload or to send it to event gate. With this task, I am wondering whether the json schemas are valid for the event gate infras... [12:31:46] 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10Ottomata) Hm, I am not sure if we have draft/2019-09 support in EventGate, but it would not be difficult to add if we don't. Since you will be generated the JSONSchemas yourself, I betcha there is a... [12:34:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service,refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:29] btullis: the refinery scap deployment finished successfully :], however I noticed that the regular scap deploy (not the `-e thin` or `-e hadoop-test`) was super quick. Usually it's the slowest one. Could it be because most of the git-fat jar files were synced yesterday at my first try? And so today it's only done what was remaining? [12:37:37] PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:38:32] mforns: its possible i thiink yes, it would mean that most of the hosts succeeded and had a copy of the deploy and it was just a symlink change/ roll forward? [12:38:34] mforns: That sounds very plausible to me. [12:39:22] ok btullis and ottomata thanks! will check that the code is present in an-launcher1002 and continue the deploy [12:42:01] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:45:48] ottomata, btullis: I got an error when executing the refinery-deploy-to-hdfs script: [12:45:55] https://www.irccloud.com/pastebin/0KJIm0jg/ [12:46:06] k looking [12:46:15] seems related to git-fat process... [12:46:53] yah git fat did not run here for sure. [12:46:55] hmm [12:47:05] going to try some deploys nthings [12:52:15] (03CR) 10Gmodena: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [12:54:13] mforns: i did a -f redeploy on an-launcher1002 and it worked, am doing a full deploy -f now [12:54:33] ok, thank you a lot [12:54:45] okay proceed now [12:54:48] scap deploy looks better i think [12:55:26] ottomata: can I go ahead and execute refinery-deploy-to-hdfs in an-launcher? [12:56:05] RECOVERY - Hadoop HDFS Namenode FSImage Age on an-master1002 is OK: FILE_AGE OK: /srv/hadoop/name/current/VERSION is 109 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:57:17] PROBLEM - Check unit status of refine_netflow on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:57:38] mforns yes proceed [12:57:42] k [13:00:23] RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:00:29] ottomata: it's working now [13:00:33] gr8 [13:04:45] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:12:55] !log re-deployed refinery with scap and refinery-deploy-to-hdfs [13:13:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:14:28] joal: I just deployed https://gerrit.wikimedia.org/r/c/analytics/refinery/+/807922/, I believe it will be picked up immediately by the Airflow job. Is that OK, or is there something else to do for that job? [13:45:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:27] RECOVERY - Check unit status of refine_netflow on an-launcher1002 is OK: OK: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:59:31] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has identified the cause of the reversed device n... [14:05:05] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with err... [14:07:06] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I'm updating the BIOS as well from version 2.13.3 to version 2.14.2 since it was marked as urgent by Dell. {F35286599,width=80%} [14:12:54] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye [14:13:48] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) [15:31:28] 10Data-Engineering, 10Data-Engineering-Kanban: Upgrade to latest PrestoDB and enable iceberg support - https://phabricator.wikimedia.org/T311525 (10Ottomata) > I'm worried that the Presto Iceberg connector might not have kerberos support? Def the current problem. Joal created a test iceberg table via spark i... [15:46:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10RobH) >>! In T307399#8037215, @BTullis wrote: > The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has ident... [15:53:52] 10Data-Engineering, 10Data-Engineering-Kanban: Upgrade to latest PrestoDB and enable iceberg support - https://phabricator.wikimedia.org/T311525 (10Ottomata) It does look like Presto is just re-using its Hive and HDFS auth components in its Iceberg connector, so it should work. Investigating... [15:57:42] I sent this to mforns but i think joal and milimetric are also going to like it: https://www.usenix.org/system/files/pepr22_slides_arvesen.pdf [15:58:42] nuria_: Many thanks. That looks really interesting. [15:58:59] Thanks nuria_ - will read :) [16:06:57] mforns: Thanks for the deploy - I'll rerun the job and make sure everything is as expected :) [16:14:19] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Proof of concept of Cassandra loading - https://phabricator.wikimedia.org/T307935 (10JAllemandou) [16:14:51] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Proof of concept of Cassandra loading - https://phabricator.wikimedia.org/T307935 (10JAllemandou) This has been tested successfully - we'll need simple HQL jobs to load cassandra from now on :) [16:19:26] 10Data-Engineering-Kanban, 10Data-Engineering-Radar, 10Generated Data Platform, 10Patch-For-Review: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) a:05Ottomata→03dcausse David, assigning this to you! Am here to help but don't have a lot of time... [16:51:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I tried the `partman/custom/kafka-jumbo.cfg` partman recipe on this how, but it didn't seem to be applied. When I checked the log I saw thi... [17:27:16] !log killed mediawiki-history-load bundle in Hue, and started corresponding mediawiki_history_load DAG in Airflow [17:27:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:57:35] joal: cool! I see the editors_daily_monthly DAG failed, is this related to the deploy? [19:13:41] does a table exist in hive that contains the latest (wiki name, page_id, title) for wikis? I know i could try and pull it from wmf.mediawiki_history but there might be a simpler way [19:51:54] hm, latest like up to date? i don't think so, you'd have to combine a mw snapshot, either wmf_raw.mediawiki_page or mediawiki history with event.mediawiki_page_create? [19:51:59] ebernhardson: we sqoop snapshots of a set of mediawiki tables monthly into Hive, maybe wmf_raw.mediawiki_page helps? [19:52:18] arg you beat me :] [19:52:38] doesn't have to be up to date, i'm reading may 2022 data anyways, anything thats mostly complete is good enough. The mediawiki tables sound perfect, thanks! [19:53:43] however! we are working on a streaming dataset that will have what you want: https://phabricator.wikimedia.org/T311129 [19:54:28] i suppose by latest i mostly meant i don't have to run an aggregation over the table and figure out which row is the current value [19:54:57] wmf_raw.mediawiki_page will work, but its monthly snapshots [20:11:48] 10Data-Engineering: Public Druid cluster leftovers from old mw history snapshots - https://phabricator.wikimedia.org/T311547 (10Milimetric) I checked the [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Delete_segments | Coordinator UI ]] and didn't see `mediawiki_history_reduced_2022_02`, only `..... [20:23:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Upgrade to latest PrestoDB and enable iceberg support - https://phabricator.wikimedia.org/T311525 (10Ottomata) GOT IT! @JAllemandou try it out on test cluster. [20:27:03] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster [20:37:37] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec... [20:41:41] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS buster [20:48:36] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS buster exec... [20:49:26] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye [21:08:28] (03PS1) 10Milimetric: Add world map echart [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/809683 [21:36:03] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex... [22:09:40] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) @BTullis Robh figured out the workaround to get the right raid volume to boot first. I tried on an-presto1006 and everything seeme... [23:07:59] PROBLEM - Check unit status of analytics-dumps-fetch-pageview on clouddumps1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:08:21] PROBLEM - Check unit status of analytics-dumps-fetch-unique_devices on clouddumps1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:13:14] PROBLEM - Check unit status of analytics-dumps-fetch-clickstream on clouddumps1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-clickstream https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:13:41] PROBLEM - Check unit status of analytics-dumps-fetch-mediacounts on clouddumps1001 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:16:21] PROBLEM - Check unit status of analytics-dumps-fetch-clickstream on clouddumps1001 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-clickstream https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:16:29] PROBLEM - Check unit status of analytics-dumps-fetch-mediacounts on clouddumps1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:18:47] PROBLEM - Check unit status of analytics-dumps-fetch-unique_devices on clouddumps1001 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:43:49] 10Analytics, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10BPirkle) Worked a bit more on this today. I was able to execute an canned example druid query via the instructions [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid |...