[00:00:06] RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_daily on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:22:30] 10Data-Engineering, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: Problem details for HTTP APIs (rfc7807) - https://phabricator.wikimedia.org/T302536 (10Milimetric) >>! In T302536#7833771, @Eevans wrote: > !!For the new service, we will include the following attributes:!!... [00:47:51] 10Data-Engineering, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: Problem details for HTTP APIs (rfc7807) - https://phabricator.wikimedia.org/T302536 (10Eevans) >>! In T302536#7834045, @Milimetric wrote: >>>! In T302536#7833771, @Eevans wrote: >> !!For the new service, we w... [04:41:13] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation): Upgrade dbstore100* hosts to Bullseye - https://phabricator.wikimedia.org/T299481 (10Marostegui) >>! In T299481#7833034, @razzi wrote: > > - since these databases were multiinstance, I'm not sure that the `sudo... [04:42:45] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation): Upgrade dbstore100* hosts to Bullseye - https://phabricator.wikimedia.org/T299481 (10Marostegui) 05Open→03Resolved [04:57:14] 10Data-Engineering-Radar, 10Data-Services, 10cloud-services-team (Kanban): Upgrade clouddb* hosts to Bullseye - https://phabricator.wikimedia.org/T299480 (10Marostegui) Any ETA on this? [08:38:16] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [08:43:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [09:10:49] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:12:01] --^ I think that this is my fault. I restarted an-druid1001 as part of T304938 - I will rerun the job. [09:13:07] ACKNOWLEDGEMENT - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly Btullis I restarted an-druid1001 as part of T304938 - which triggered this alert. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:17:58] 10Data-Engineering, 10Data-Engineering-Kanban: Upgrade Turnilo - https://phabricator.wikimedia.org/T301990 (10ayounsi) Thanks for the task I was looking for the exact same thing. The scatterplot visualization would be useful for us in SRE as well. [09:18:09] !log restarted eventlogging_to_druid_netflow_hourly on an-launcher1002 [09:18:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:30:48] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:05:00] (03PS4) 10Snwachukwu: [WIP] Create a Hive to Graphite job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/775376 (https://phabricator.wikimedia.org/T304623) [13:18:26] 10Data-Engineering-Radar, 10MW-1.39-notes (1.39.0-wmf.7; 2022-04-11): Decommission the UploadWizard* instruments - https://phabricator.wikimedia.org/T305238 (10phuedx) [13:23:27] (03PS1) 10Phuedx: Remove UploadWizard* allowlist entries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/777808 (https://phabricator.wikimedia.org/T305238) [13:26:42] 10Data-Engineering-Radar, 10MW-1.39-notes (1.39.0-wmf.7; 2022-04-11), 10Patch-For-Review: Decommission the UploadWizard* instruments - https://phabricator.wikimedia.org/T305238 (10phuedx) [13:28:42] 10Data-Engineering-Radar, 10MW-1.39-notes (1.39.0-wmf.7; 2022-04-11), 10Patch-For-Review: Decommission the UploadWizard* instruments - https://phabricator.wikimedia.org/T305238 (10phuedx) > Mark the legacy EventLogging schemas as inactive | Schema | | | --- | --- | | UploadWizardErrorFlowEvent | https://met... [13:31:56] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10cmooney) @Cmjohnson no idea unfortunately, it should match the partman config so my guess is you are right, but I can't really confirm. Perha... [14:19:44] I notice that we still have a mediawiki-history-drop-snapshot alert that has been active for a while. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=an-launcher1002&service=Check+unit+status+of+mediawiki-history-drop-snapshot [14:21:30] I think it's related to this ticket. https://phabricator.wikimedia.org/T303988 [14:24:54] (03PS1) 10Tchanders: Add event_ipinfo_version to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777816 (https://phabricator.wikimedia.org/T296417) [14:25:25] (03CR) 10jerkins-bot: [V: 04-1] Add event_ipinfo_version to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777816 (https://phabricator.wikimedia.org/T296417) (owner: 10Tchanders) [14:38:09] The last time this happened we just used a command like this: `sudo -u analytics kerberos-run-command analytics hdfs dfs -touchz /wmf/data/wmf/wikidata/item_page_link/snapshot=2022-02-28/_SUCCESS` [14:38:26] ...for each directory that didn't have a _SUCCESS file, then restarted the job. [14:38:40] 10Data-Engineering: Drop sanitized UploadWizard* data - https://phabricator.wikimedia.org/T305556 (10phuedx) [14:39:39] 10Data-Engineering-Radar, 10MW-1.39-notes (1.39.0-wmf.7; 2022-04-11), 10Patch-For-Review: Decommission the UploadWizard* instruments - https://phabricator.wikimedia.org/T305238 (10phuedx) [14:47:28] 10Data-Engineering, 10Equity-Landscape: Milestone: Input Data Models Complete. - https://phabricator.wikimedia.org/T305473 (10ntsako) a:03ntsako [14:48:03] 10Data-Engineering, 10Equity-Landscape: Milestone: Transformation Definitions Complete: - https://phabricator.wikimedia.org/T305474 (10ntsako) a:03ntsako [14:48:41] 10Data-Engineering, 10Equity-Landscape: Milestone: Ingest and Transform Input Data - https://phabricator.wikimedia.org/T305475 (10ntsako) a:03ntsako [14:58:04] heya team (ottomata, joal, milimetric, aqu, and all!) I'm going to merge and deploy the Airflow refactor. This (hopefully not) can cause some Airflow alerts. [14:59:48] mforns: ack [14:59:58] 🤞 [15:00:03] sorry btullis, forgot to ping you! [15:00:05] hehehehe [15:00:06] yea [15:01:02] +1 [15:04:11] Without wanting to get ahead of ourselves, while you're thinking about airflow things it might be a good time to start experimenting with adding the datahub dependency package to airflow. [15:04:12] https://datahubproject.io/docs/lineage/airflow/#setting-up-airflow-to-use-datahub-as-lineage-backend [15:06:38] 10Data-Engineering, 10Airflow: [Airflow] Organize hackathon - https://phabricator.wikimedia.org/T295204 (10JArguello-WMF) [15:13:11] This is a pretty exciting result for me. It's only on staging and I haven't logged in yet, but definite progress. [15:13:14] https://www.irccloud.com/pastebin/mOME7Jae/ [15:15:02] 10Data-Engineering: PySpark is unable to find Hive tables - https://phabricator.wikimedia.org/T305457 (10bmansurov) Hi @JAllemandou! Thanks for the reply and fixing the tag. I have followed the instructions from the link you shared. I've created a new stacked environment and activated it before running. ` (know... [15:23:22] !log deployed Airflow to analytics_test (big refactor) [15:23:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:32:20] (03PS7) 10Aqu: Add archiving job for Airflow [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/774383 (https://phabricator.wikimedia.org/T300039) [15:35:56] btullis: woohooo! [15:38:14] https://usercontent.irccloud-cdn.com/file/kxIDo1zU/image.png [15:38:24] Getting there... [15:40:53] 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF) [15:51:24] !log deployed airflow to analytics (big refactor) [15:51:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:06:10] 10Data-Engineering: PySpark is unable to find Hive tables - https://phabricator.wikimedia.org/T305457 (10JAllemandou) That's weird. I don't see any change on stat1007 in SAL lately, so I'm out of ideas :( @Ottomata any suggestion? [16:14:39] 10Data-Engineering: PySpark is unable to find Hive tables - https://phabricator.wikimedia.org/T305457 (10Ottomata) Hiya, I think the issue is that you are using python directly to instantiate a SparkSession. You can do this, but it is the most manual way and will require more configuration than you are giving i... [16:23:49] razzi: btullis was planning on skipping sre meeting today if that's ok, lemme know if you need me for anything [16:24:55] That's no problem from my side. razzi - shall we all skip it, or would you rather meet? [16:25:27] ottomata: btullis ok cool my main thing is trying to figure out the timing for rebooting hosts to pick up the new kernels [16:26:09] for example an-launcher1002.eqiad.wmnet - do we have a standby launcher? [16:27:26] razzi: Nope. No standby launcher at the moment. [16:29:29] It's a good question though. How would we know which timers were omitted because of a reboot? [16:30:27] which timers were omitted? [16:30:35] btullis: you mean like which timers missed a run? [16:30:54] I mean skipped? Yep, missed a run becaus the server was off for a few minutes. [16:30:56] i think we don't usually worry about that. using timers that depend on specific runs never really works very well. [16:31:23] Cool. [16:31:37] also btullis I could use your help grokking the mediawiki-history-drop-snapshot.service failure :) [16:32:17] OK, will be right with you. bc? [16:36:33] 10Data-Engineering-Radar, 10MW-1.39-notes (1.39.0-wmf.7; 2022-04-11), 10Patch-For-Review: Decommission the UploadWizard* instruments - https://phabricator.wikimedia.org/T305238 (10phuedx) [16:48:52] (03PS1) 10Phuedx: Remove ref to analytics/limn-multimedia-data repo [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/777838 (https://phabricator.wikimedia.org/T305565) [16:49:46] (03CR) 10jerkins-bot: [V: 04-1] Remove ref to analytics/limn-multimedia-data repo [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/777838 (https://phabricator.wikimedia.org/T305565) (owner: 10Phuedx) [16:50:11] (03PS2) 10Phuedx: Remove analytics/limn-multimedia-data repo reference [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/777838 (https://phabricator.wikimedia.org/T305565) [16:50:50] (03CR) 10jerkins-bot: [V: 04-1] Remove analytics/limn-multimedia-data repo reference [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/777838 (https://phabricator.wikimedia.org/T305565) (owner: 10Phuedx) [17:03:18] (03CR) 10Phuedx: "Recheck" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/777838 (https://phabricator.wikimedia.org/T305565) (owner: 10Phuedx) [17:35:17] (03PS1) 10Phuedx: mediawiki/client/metrics_event: Add mediawiki.db_name property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689) [17:48:57] (03PS3) 10Vivian Rook: Expose history of query revisions [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/773578 (https://phabricator.wikimedia.org/T100982) [17:56:56] (03CR) 10jerkins-bot: [V: 04-1] Expose history of query revisions [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/773578 (https://phabricator.wikimedia.org/T100982) (owner: 10Vivian Rook) [18:09:14] (03PS4) 10Vivian Rook: Expose history of query revisions [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/773578 (https://phabricator.wikimedia.org/T100982) [18:19:28] (03PS5) 10Vivian Rook: Expose history of query revisions [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/773578 (https://phabricator.wikimedia.org/T100982) [18:23:58] (03CR) 10Vivian Rook: Expose history of query revisions (033 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/773578 (https://phabricator.wikimedia.org/T100982) (owner: 10Vivian Rook) [18:55:03] (03CR) 10Joal: "Comments inline - happy to discuss them" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/772027 (https://phabricator.wikimedia.org/T300029) (owner: 10Aqu) [19:06:29] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10Research, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Volans) If I may add my use case too, I would like to be able to restrict the access to the webproxies from the cumin... [19:54:54] 10Analytics-Radar, 10Data-Engineering, 10Discovery, 10Event-Platform: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string - https://phabricator.wikimedia.org/T286814 (10sbassett) [20:48:08] 10Data-Engineering: Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link - https://phabricator.wikimedia.org/T305591 (10razzi) [20:48:45] 10Data-Engineering: Error with refinery-drop-mediawiki-snapshots: table specs not matching partitions for wmf/wikidata/entity and wmf/wikidata/item_page_link - https://phabricator.wikimedia.org/T305591 (10razzi) a:03razzi [20:53:46] !log roll restart aqs to deploy new mediawiki history snapshot [20:53:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:55:58] (03PS1) 10Jdrewniak: Fixing typo in desktopwebuiactionstracking schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777876 (https://phabricator.wikimedia.org/T301391) [20:56:32] (03CR) 10jerkins-bot: [V: 04-1] Fixing typo in desktopwebuiactionstracking schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777876 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdrewniak) [21:21:09] (03CR) 10Jdlrobson: [C: 03+2] Fixing typo in desktopwebuiactionstracking schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777876 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdrewniak) [21:21:39] (03CR) 10jerkins-bot: [V: 04-1] Fixing typo in desktopwebuiactionstracking schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777876 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdrewniak) [21:21:46] (03CR) 10Jdlrobson: [C: 04-1] Fixing typo in desktopwebuiactionstracking schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777876 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdrewniak) [21:22:38] (03CR) 10Jdlrobson: [C: 04-1] "I don't think this needs a new version, as it was never used and that would be considered a breaking change?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777876 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdrewniak) [21:35:55] (03PS1) 10Razzi: Upgrade to upstream version 1.35.0 [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/777881 (https://phabricator.wikimedia.org/T301990) [21:46:12] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Upgrade Turnilo - https://phabricator.wikimedia.org/T301990 (10razzi) I made a patch for this, but the scap deploy to staging failed due to some error with locales: ` Apr 06 21:42:49 an-tool1005 turnilo[26803]: Child process initialized in 2... [21:50:14] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Upgrade Turnilo - https://phabricator.wikimedia.org/T301990 (10razzi) Ah ok it appears we're now too far behind on nodejs versions ` Pre-requisites Node.js - 12.x or 14.x version ` https://github.com/allegro/turnilo So we'll have to up...