[04:04:59] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10tchin) [04:05:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10tchin) [07:43:20] PROBLEM - Check systemd state on analytics1077 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:20] PROBLEM - Host analytics1077 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:10] +icinga-wm │ PROBLEM - Host analytics1077 is DOWN: PING CRITICAL - Packet loss = 100% << Expected or dead? [08:22:15] (Hiya) [08:30:05] It's spitting stack traces [08:33:22] claime: o/ sometimes it happens with hadoop workers, a powercycle fixes [08:33:31] ack [08:33:39] feel free to go ahead, no real issue in doing it [08:36:15] ok, power cycling [08:39:24] RECOVERY - Host analytics1077 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [08:39:54] RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:00] Nice [08:42:46] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=41 should also recoveras well [09:07:55] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Create a shared flink docker image - https://phabricator.wikimedia.org/T316519 (10JMeybohm) >>! In T316519#8444547, @Ottomata wrote: > - flink-kubernetes-operatore [[ https://nightlies.apache.org/flink/flink-kubernete... [10:48:54] CUSTOM - Host an-coord1001 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [11:05:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:08] CUSTOM - Host an-coord1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:11:31] 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10BTullis) a:03BTullis Thanks @Clement_Goubert - I believe that @odimitrijevic is already working on getting an updated licence. I'll claim this ticket and ta... [11:13:43] 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10Clement_Goubert) FYI, I still reset the failed state on puppetmaster1001 so we get alerted if another service fails, and left a persistent comment with this ta... [11:16:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:09] I noticed that the restart detection in debdeploy was failing for stat1004; after some debugging it turns out the /mnt/hdfs mount on stat1004 is broken, trying to even run "ls /mnt/hdfs" also stalls. known issue? [11:49:53] moritzm: Thanks. I think you're right that it's broken. I'll look into it now. Normally for root it returns `ls: cannot access '/mnt/hdfs': Input/output error` [11:52:35] milimetric: has a process open that was trying to read from it. If I recall he specializes in breaking the HDFS mount :-) [11:52:40] https://www.irccloud.com/pastebin/ZePCiBhD/ [11:53:51] !log attempting to unmount and remount `/mnt/hdfs` on stat1004 [11:53:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:54:08] looking at ps I can see hung lsof (which get emitted by debdeploy under the hood) dating back to Nov23 [11:55:36] which is a little strange, given that we pass "-e /mnt/hdfs" to lsof specifically to avoid parsing the the HDFS data storage given that it's not relevant for detecting necessary restarts [11:56:42] Yeah, that is odd. [12:01:45] root 26305 0.0 0.0 6640 140 ? Ss Nov30 0:00 bash -c test -d /mnt/hdfs [12:02:01] This server is not well lol [12:02:16] lsof initially parses /proc/mounts, so it seems it hangs there before the "-e /mnt/hdfs" exclusion kicks in [12:02:51] There are hung df, hung lsof all overt the place [12:03:00] Hung ls -l too [12:04:20] Yeah, there was an oom-killer on Dec 02, the SATA controller is twerking every so often, but I think that the HDFS issue is probably just software. It's destined for the skip but I'll try to schedule a reboot. [12:04:55] elukey: is fond of it like a pet. But it is cattle. :-) [12:33:13] matomo1002 has its prometheus-mysqld-exporter stopped since 2022-09-26, is that on purpose? [12:34:05] Nope. I'll look at that too. Thanks Clément. [12:34:14] btullis: Thanks <3 [12:34:41] Also there's one more with my name on it... Hang on a sec. [12:36:20] Oh stat1007 systemd. I'll look at that too. [12:47:57] !log sudo systemctl restart wmf_auto_restart_prometheus-mysqld-exporter.service on matomo1002 [12:47:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:48:02] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:33] (ops week folks) steve_munene: btullis: there is one failed refine_event job alert, steve_munene, have you re-run these before? if not, want to do it together? [13:38:37] I haven't let's give it a go ottomata [13:38:42] Damn, I missed this again. [13:39:12] okay steve_munene do you see the Data-engineering-alerts email for 'Refine failures for job refine_event' [13:39:12] ? [13:39:47] we will be following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine#Rerunning_jobs [13:39:56] Yes [13:40:26] 10Data-Engineering-Planning, 10DC-Ops, 10ops-eqiad: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10BTullis) [13:40:35] okay, so that email gives us most of the command we need to run, from an-launcher1002 [13:41:24] so log into an-launcher1002 [13:41:51] we need to do the kerberos-run-command thing, and run the correct refine job script. we can figure out which one ffrom the subject of the email. [13:41:55] in this case refine_event [13:42:31] following [13:42:35] so that will be: sudo -u analytics kerberos-run-command analytics refine_event [13:42:43] plus the rest of the args to add from the email [13:43:10] alright [13:43:29] so the total command will be [13:43:40] sudo -u analytics kerberos-run-command analytics refine_event --ignore_failure_flag=true --table_include_regex='mediawiki_talk_page_edit' --since='2022-12-05T02:00:00.000Z' --until='2022-12-06T03:00:00.000Z' [13:43:50] go ahead and run that from an-launcher1002 [13:44:32] I have that [13:44:40] its running? [13:45:10] refine_event is not running [13:45:49] did you run that command? [13:46:58] i think i saw it in ps for a sec, did it print out an application_id? [13:47:05] yes got this towards the end [13:47:06] 22/12/06 13:45:48 INFO Client: Application report for application_1663082229270_498300 (state: RUNNING) [13:47:06] then [13:47:06] 22/12/06 13:45:54 INFO ShutdownHookManager: Shutdown hook called [13:47:14] great [13:47:25] okay, now we just need to verify that it succeeded from the logs [13:47:27] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine#logs [13:47:37] i usually just grep those for 'Refine:' to see only the relevant messages [13:48:04] sudo -u analytics kerberos-run-command yarn logs -applicationId application_1663082229270_498300 | grep Refine: [13:48:07] run ^ [13:48:26] ah wait [13:48:36] sudo -u analytics kerberos-run-command analytics yarn logs -applicationId application_1663082229270_498300 | grep Refine: [13:48:37] that one ^ [13:49:50] if you run that, do you see the lines that say 'Finished refinement ' ... ? [13:49:54] 22/12/06 13:45:53 INFO Refine: Successfully refined 2 of 2 dataset partitions into table `event`.`mediawiki_talk_page_edit` (total # refined records: 375) [13:49:56] yes [13:49:58] that is good too [13:50:14] so now, to ack that you have rerun successfully, reply to the email [13:50:47] I usally paste: 1. the command I ran to refine, 2. the application_id that was run, and 3. a line from the log indicating sucesss (the one you pasted just now is fine) [13:54:15] Thans ottomata [13:55:33] (03PS18) 10Aqu: Add HdfsXMLFsImageConverter to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) [13:56:09] (03CR) 10Aqu: Add HdfsXMLFsImageConverter to refinery-job (0312 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu) [13:56:23] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10BTullis) @Cmjohnson - I can shut down the machine at any time - or you can do it if it helps too. There's no depooling necessary, just dow... [13:56:29] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Create a shared flink docker image - https://phabricator.wikimedia.org/T316519 (10Ottomata) > I don't think the webhook has something to do with TLS Ah okay, I am super green here and don't have much experience writi... [13:56:53] steve_munene: did you reply to the email? [13:58:40] joal: Last version of HdfsXMLFsImageConverter for your review https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/852315/17..18 Thanks! [13:58:40] ah, just got it [13:58:46] perfect, thank you! [13:58:48] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10BTullis) Set to failed in netbox. https://netbox.wikimedia.org/dcim/devices/3661/ {F35841349} [13:59:11] steve_munene: just a FYI, there are 2 different levels of reporting log lines in that output [13:59:24] the ones that start with Finsihed refineing... are telling you the result of refining a specific hour of a specific dataset [13:59:51] the last one that starts with Succesfully refined... is telling you the result of refining all hours of a specific dataset [14:00:04] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10BTullis) [14:00:24] btullis: that fuse hdfs mount died a long time ago, and I think I killed it on stat1005 as well. I closed all my terminals and screens everywhere since then, so you can freely kill anything [14:00:27] so, there were 2 datasets here that had 'Finished refining', and one of them, the 'codfw' dataset, had 0 records refined. this is normal, because there isn't any data from the codfw datacenter usually for many datasets [14:01:04] *scratch that, stat1005 is fine [14:01:28] milimetric: Many thanks. I'll reboot stat1004 now. 👍 [14:01:34] only mentioning this because you pasted the 'Finished refining ' log line only for the eqiad dataset. You can choose how much context you want to give in the ACK reply email. In thiis case, pasting both fo the 'Finished refining' lines would have been fine. Sometimes, there are many many datasets to re-refine, and ini that case the reply would be too verbose, so the ones startinig with 'Successfully refined' will be sufficient [14:01:45] anyway! thank you for re-refining! :) [14:01:59] thanks for the notes ottomata [14:08:29] (03CR) 10Joal: [C: 03+1] "Thanks a lot for the changes @Aqu - for me this is ready" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu) [14:09:13] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10tchin) a:03tchin [14:32:06] (03PS3) 10Snwachukwu: [WIP] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) [14:32:59] (03CR) 10Joal: "First round of comment - happy to discuss about how to reorganize this. Adding Marcel as well for his opinion." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) (owner: 10Snwachukwu) [15:05:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink application and flink-kubernetes-operator production docker images - https://phabricator.wikimedia.org/T316519 (10Ottomata) [15:27:59] 10Data-Engineering, 10Event-Platform Value Stream: Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10Ottomata) [15:28:11] 10Data-Engineering, 10Event-Platform Value Stream: Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10Ottomata) [15:29:00] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05): Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10Ottomata) [15:32:41] (03PS19) 10Aqu: Add HdfsXMLFsImageConverter to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) [15:33:40] 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10Ottomata) [15:34:13] 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10Ottomata) [15:34:15] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05): Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10Ottomata) [15:34:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure: [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10Ottomata) [15:34:27] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink application and flink-kubernetes-operator production docker images - https://phabricator.wikimedia.org/T316519 (10Ottomata) [15:34:38] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Shared-Data-Infrastructure: [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10Ottomata) [15:34:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink application and flink-kubernetes-operator production docker images - https://phabricator.wikimedia.org/T316519 (10Ottomata) [15:36:51] 10Data-Engineering, 10Event-Platform Value Stream: [EPIC] Flink Applications on Kubernetes - https://phabricator.wikimedia.org/T324578 (10Ottomata) [15:44:34] (03PS20) 10Aqu: Add HdfsXMLFsImageConverter to refinery-job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) [15:51:52] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05): Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10JMeybohm) > This will mean adapting the upstream helm chart to fit in our deployment-charts repo with template conventions there. I would lean towa... [15:53:14] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05): Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10Ottomata) OH! If that is possible/okay then that will be much much easier. Alright, I'll try that first and we'll see how that goes. CC @BTullis... [15:56:04] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05): Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10JMeybohm) >>! In T324576#8447323, @Ottomata wrote: > OH! If that is possible/okay then that will be much much easier. Alright, I'll try that first... [16:01:54] 10Data-Engineering, 10API Platform (Sprint 02), 10AQS2.0, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Pageviews: Implement Unit Tests - https://phabricator.wikimedia.org/T299735 (10JArguello-WMF) Hi @codebug ! How is this one going? Any updates? [16:02:54] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10BTullis) Oh, right. Yes that might have made it a bit easier, but I'm not 100% sure. The thing about the work I've done on the... [16:04:35] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10Ottomata) Def will need specialized RBAC for Flink, but if we don't have to add all our common templates stuff, ideally we can... [16:33:36] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10Ottomata) https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/helm/#watching-only-specific-... [16:51:31] Do we have anything to be deployed by train today? [16:51:38] (03PS4) 10Snwachukwu: [WIP] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) [16:52:16] (03CR) 10Aqu: "Last review done. Thanks @Joal & @Mforns !" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/852315 (https://phabricator.wikimedia.org/T321168) (owner: 10Aqu) [17:15:55] 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10BTullis) I checked with @odimitrijevic and she believes that it will take a few days to get the updated licence. She'd prefer that we do not disable the downlo... [18:13:15] 10Data-Engineering-Planning, 10Product-Analytics, 10Wmfdata-Python: Remove Matplotlib as a Wmfdata-Python dependency - https://phabricator.wikimedia.org/T324053 (10mpopov) p:05Low→03Triage Removing priority. To be set by @EChetty. [19:05:21] 10Data-Engineering, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10Spaceliberty) @BTullis Good afternoon. I found this forum by accident and I need your help. I work in security and would like to log Presto requests in particular: the SQL query itself from the use... [19:26:11] 10Data-Engineering-Planning, 10Product-Analytics: wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10odimitrijevic) [19:28:21] 10Data-Engineering-Planning, 10Product-Analytics: wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10odimitrijevic) @kzimmerman let's discuss prioritizing. A significantly larger overcount may exist for the wikimedia project family. [19:29:00] 10Data-Engineering-Planning, 10Product-Analytics: wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JAllemandou) [19:47:24] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink Kubernetes Operator Helm chart - https://phabricator.wikimedia.org/T324576 (10Ottomata) [20:24:30] (03PS6) 10Aqu: Declare the HDFS usage dataset in hive metastore [analytics/refinery] - 10https://gerrit.wikimedia.org/r/853303 (https://phabricator.wikimedia.org/T321169) [20:31:50] (03PS4) 10Aqu: Add script for HDFS XML fsimage to bin folder [analytics/refinery] - 10https://gerrit.wikimedia.org/r/850169 (https://phabricator.wikimedia.org/T321167) [20:41:35] (03CR) 10Tsevener: [C: 03+1] Add ios talk page interaction schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/857759 (https://phabricator.wikimedia.org/T321841) (owner: 10Mazevedo) [20:48:28] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) [21:12:57] 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) flink-kubernetes-operator helm chart RBAC works in one of two modes: - cluster scoped - meaninig the operator can manag... [21:40:02] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Cmjohnson) [21:41:04] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Cmjohnson) 05Open→03Resolved @BTullis these servers are ready for you to image. BIOS/Network and firmware have been updated. I...