[09:23:14] 10Data-Engineering, 10Product-Analytics: Request for SQL Templating to be enabled in Superset - https://phabricator.wikimedia.org/T312134 (10BTullis) 05Open→03Resolved p:05Triage→03Medium Great! Glad it works as expected. [13:23:07] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10BTullis) I think that there is an easier way to test this patch, given that we don't need the whole toolchain. We just need... [13:31:25] btullis: nice idea on testing the script, for some reason I thought it was compiling some kind of jar executable thing but of course i'ts python :P. Trying a stateful ingestion would be cool, and would further test the patch in edge cases, so I say go for it. If it doesn't work either way we can always delete and re-ingest [13:34:29] OK, cool, thanks. I'm wondering if the stateful runs need to be set from the first run, so it might not work if enabled from a subsequent run. Anyway, I will try and see how it goes. [13:39:26] I think probably from the first run, I was wondering the same yeah [13:41:37] Hmm. Now, why would I be getting this? [13:41:42] https://www.irccloud.com/pastebin/GjhdHC2M/ [13:42:06] I've done one of these: `pip install apache-superset==1.4.2` and it was happy. [13:45:17] 10Analytics, 10Data-Engineering, 10Pageviews-API, 10Pageviews-Anomaly: "Venuše (planeta)" on cs.wp has surprisingly high numbers in Pageviews Analysis (and also Topviews Analysis) - https://phabricator.wikimedia.org/T239532 (10Urbanecm) [13:48:01] (03PS2) 10Milimetric: Add world map echart [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/809683 [13:49:50] btullis: I had that too... I installed all the other requirements first with conda (that libffi fii? whatever that was I asked you about) and then I `pip uninstall apache-superset` and reinstalled it and it was fine [13:49:55] but it wasn't finding superset for me either [13:51:36] Thanks. Good to know. Trying that now. [13:58:03] 10Data-Engineering, 10Research-Backlog: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10Urbanecm) [14:14:59] 10Data-Engineering, 10Pageviews-Anomaly: Analyze possible bot traffic for frwiki article Cookie (informatique) - https://phabricator.wikimedia.org/T313114 (10Urbanecm) Thanks for the report! I checked data that are available in [Turnilo](https://turnilo.wikimedia.org) about this article. Looks to be public clo... [14:18:32] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10BTullis) I have tried a stateful ingestion but it fails validation: ` btullis@stat1008:~/src/datahub/ingestion$ datahub inge... [14:21:26] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10BTullis) Deleted the entities associated with superset. ` (2022-04-14T15.32.43_btullis) btullis@stat1008:~/src/datahub/inges... [14:22:00] ottomata: hello! :] did you restart the eventlogging processor yesterday when you merged the patch? just checking to know if I can submit the extension patch [14:24:20] milimetric: Nice looking error for you :-) https://phabricator.wikimedia.org/P31145 [14:27:52] Do you want to pair on this or anything? I've got superset running and I can just leave it on port 8088 if you want to try running the ingester against it yourself. [14:28:11] um... if you want we can pair on this, I think I know what it is [14:28:54] Sure thing. Batcave? [14:28:56] omw [14:34:15] (03PS2) 10Milimetric: Add base_uri config parameter to Superset source [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/812023 (https://phabricator.wikimedia.org/T306903) [14:50:09] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10BTullis) Looking good. After one small modification to the patch (https://gerrit.wikimedia.org/r/c/analytics/datahub/+/81202... [14:52:26] (03CR) 10Btullis: "This works, but we're going to upstream it instead of merging this to our fork." [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/812023 (https://phabricator.wikimedia.org/T306903) (owner: 10Milimetric) [15:04:23] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10Milimetric) Pull request sent upstream with https://github.com/datahub-project/datahub/pull/5408 [15:13:03] (03CR) 10Ottomata: Schemas for Gerrit (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [15:16:55] (03CR) 10Ottomata: [C: 03+1] "I'm going to be out for 2 weeks, one for offsite, one for vaca! +1 in general! I think you should be able to merge yourself if you are re" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [15:18:36] heya btullis :] can you please restart the eventlogging processors in eventlog1003? Andrew merged the corresponding change (https://gerrit.wikimedia.org/r/c/operations/puppet/+/813925) yesterday, but I'm not sure if he did restart them, didn't ask him. I believe they need to be restarted before my next patch is deployed in mediawiki [15:19:03] the command specified in the docs is: sudo service eventlogging-processor@client-side-* restart [15:19:12] mforns, doing it now. [15:19:49] thank you! [15:21:02] (03CR) 10Hashar: Schemas for Gerrit (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [15:21:42] (03CR) 10Joal: "I have reviewed the 3 first files, and the comments I have for them repeats for all the other files. Nothing major, mostly comments about " [analytics/refinery] - 10https://gerrit.wikimedia.org/r/812095 (https://phabricator.wikimedia.org/T311507) (owner: 10NOkafor) [15:22:28] mforns: They were restarted 19 hours ago. They are running but there are lots of warnings and errors. You can run this without root if it helps: `systemctl status --no-pager eventlogging-processor@client-side-*` [15:22:51] Would you still like me to restart them? [15:23:00] oh! thanks, hm 19 hours ago... [15:24:51] Andrew merged the patch yesterday 9:50 pm my time, now it's 17:24, means... yes, it was probably him [15:25:26] can you point me to the warnings and errors please? [15:25:32] btullis ^ [15:26:36] Confirmed it was otto [15:26:43] thanks! [15:26:45] https://www.irccloud.com/pastebin/vCTK5wky/ [15:28:26] The command `systemctl status --no-pager eventlogging-processor@client-side-*` shows about 10 lines from each unit, [15:28:44] Are you able to run this? ` journalctl -u eventlogging-processor@client-side-00.service` [15:28:47] oh, ok! looking [15:30:28] I'm not able to see the logs, in other machines I'm allowed to use sudo for journalctl, but not in this one [15:35:36] https://usercontent.irccloud-cdn.com/file/lAdDpxCR/eventlogging-logs-20220715.txt.gz [15:37:32] I uploaded a file, but then deleted it just in case there was any privacy concern. I'll just put it in your home on eventlogging1003 [15:40:12] ok, btullis thanks a lot [15:45:50] btullis: I see now. These errors are validation errors of the different schemas that were configured for eventlogging, a certain percent of errors is expected. They have nothing to do with yesterday's changes. I think they are fine. [15:48:07] Cool. Thought they might be, but it's disconcerting to see them when looking at the systemctl status. :-) I guess these validation errors are all capture by logstash with the newer eventgate ones. [15:48:27] aha [16:14:49] 10Data-Engineering: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) 05Open→03Resolved I'm tempted to resolve this problem for now, given that we know there is a workaround and we know that we should try to avoid failing over during the busiest times for the cluster. [16:27:06] 10Analytics, 10Voice & Tone: Rename geoeditors_blacklist_country - https://phabricator.wikimedia.org/T259804 (10Isaac) +1 can we get a reason for why this is being declined? It would seem doable even if it's low priority though I have no idea of the complexity so could easily be missing something. More genera... [16:36:27] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10BTullis) 05Open→03Resolved [16:36:30] 10Data-Engineering, 10Data-Catalog, 10Epic: Data Catalog MVP - https://phabricator.wikimedia.org/T299910 (10BTullis) [16:40:20] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:40:41] btullis: really, is that 3? ^ [16:42:15] Same batch [16:42:26] 10Data-Engineering: RAID battery alert in an-worker1093 - https://phabricator.wikimedia.org/T313130 (10BTullis) [16:42:52] ACKNOWLEDGEMENT - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Investigating - T313130 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:43:21] btullis: is there something wrong with that batch [16:43:48] That's 3 in 2 months [16:44:26] The last one was an-worker1082 and this one is an-worker1093 so I'm not sure whether it's the same batch. Annoying though. I've scheduled a month's downtime and acked it. [16:46:26] btullis: it is [16:46:44] 78-95 were installed together [16:47:00] Would be T204177 for you [16:47:09] RhinosF1: OK, good to know, thanks. [16:47:35] https://phabricator.wikimedia.org/T207192 is the racking task [16:47:49] I can't see procurement not sure if that's useful [16:50:00] Yep, that's great. Many thanks. They're all out of warranty I think so we're scrabbling around for spares at the moment. I wonder how many more from that set of 23 are going to fall over in the next month or so. [16:51:28] That's my worry [16:51:44] Might be worth having DC-Ops look [16:52:15] See if they're prone to failure [16:52:26] Or how many are in the fleet [16:52:46] Well, FYI when these alerts go off it's only a reduction in the Hadoop cluster performance, so we can tolerate them. [16:53:38] Nothing too bad seems to have happened [16:54:05] But we also don't want to start seeing it in the rest of the fleet if it's not only that set affected or it is a wider issue [16:57:32] Yeah, sadly these batteries are consumables and they are built to expire at around about the end of a server's warranty period. It looks like we might have a batch that is particularly susceptible to early-ish failure, but it'll be a lot of work to replace them. I'll keep an eye out for any more. Thanks again. [16:58:01] btullis: no problem [17:34:35] Hi mforns - I just saw your ping on ops - sorry fro that [17:35:10] I'll be disconnecting soon but will hapilly spend some time with you before if you're available [17:36:18] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:47:49] heya joal :] do you have time now? [17:48:07] Yes mforns :) [17:48:10] cave? [17:48:13] :] yes! [19:07:23] (03CR) 10Mforns: [V: 03+2] "OK, this has been tested with all the existing deletion jobs, and the companion change is ready." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/694547 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [19:16:12] 10Data-Engineering-Kanban, 10Event-Platform, 10Wikidata, 10Wikidata-Campsite, and 4 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10mforns) [19:16:44] 10Data-Engineering-Kanban, 10Event-Platform, 10Wikidata, 10Wikidata-Campsite, and 4 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10mforns)