[00:59:41] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10Milimetric) I tried in vain to get this working. Our environment is very painful, I'll explain a bit below. * I have a pat... [05:53:11] Starting build #67 for job analytics-refinery-update-jars-docker [05:53:41] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.3 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/813513 [05:53:41] Yippee, build fixed! [05:53:42] Project analytics-refinery-update-jars-docker build #67: 09FIXED in 30 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/67/ [05:59:56] (03CR) 10Aqu: [V: 03+2 C: 03+2] "This Patchset was autogenerated by https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/813513 (owner: 10Maven-release-user) [06:16:31] !log analytics/refinery deployment [06:16:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:42:33] Hi btullis, I have a space problem on an-launcher: [06:42:33] an-launcher1002:~$ df -ha | grep srv [06:42:33] /dev/mapper/vg0-srv 109G 109G 17M 100% /srv [07:06:31] aqu: o/ [07:07:07] I see 78% free now, is the 100% reached with a single scap deployment? [07:08:22] ah wow [07:08:23] elukey@an-launcher1002:/srv/deployment/analytics/refinery-cache/revs$ sudo du -hs * [07:08:26] 26G 2f5987d70647fd46dd58e564f674d4d76e81744d [07:08:29] 7.8G bd39e672160695aaaa9d10a80fb812ac3fb1f6e2 [07:08:41] so the 26G one is from Jun 28th [07:08:48] way bigger than the previous [07:08:56] I guess that new deps etc.. were added [07:11:21] ahh ok so the .git directory is ~13G, most of them in the git fat land [07:13:47] potential solution for the moment - /srv/airflow-analytics/logs is around 49G [07:14:04] it can be easily reduced a lot to let the deployments flow [07:14:28] but medium to long term there may be an issue with an-launcher's disk space [07:28:13] Thanks elukey, I didn't heard about new deps. So I will cut into airflow-analytics/logs [07:31:10] elukey: `git fat gc` should get rid of objects that are no more referenced [07:32:10] hashar: ah nice! [07:32:20] I'll let Ben decide what's best :) [07:32:22] for the refinery-cache/revs that is maintained by scap 3 if I am right [07:32:29] it is yes [07:32:48] the idea is to deploy the code on all machines then a second step switch the symlink to one of those revs [07:33:10] so you get to switch from one ver to another one rather quickly (ie without having to wait for code to be synced on all hosts since it is already there) [07:33:18] it is merely promoting the new software by doing a symlink change [07:33:52] I don't know whether scap garbage collect old revisions from cache, and there might be some hardlink/symlink between the cached repos so that deleting one old revision might break another one [07:34:29] pretty sure scap has a cleanup command for the caches. I am checking [07:36:23] cache_revs = self.config.get("cache_revs", 5) [07:36:49] yep we keep 2 revs for refinery [07:36:54] elukey: seems like scap keeps up to 5 revisions in the cache [07:36:56] ah good [07:36:57] ;) [07:37:29] the main issue is that airflow + refinery are growing and the space on an-launcher1002 may not be enough :( [08:21:19] (03PS9) 10Joal: Update refine to use Iceberg for event_sanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) [08:37:18] Morning all. Apologies for the delay in responding. [08:49:23] I'm making a tar archive of all logs older than 30 days in /srv/airflow-analytics/logs/scheduler and then I will remove them. We might not need the archive, but I'd rather be on the safe side. [08:57:00] 10Data-Engineering: Sanitize network_flows_internal dataset - https://phabricator.wikimedia.org/T312915 (10JAllemandou) [09:23:24] Hi btullis, [09:23:25] Each time analytics/refinery is deployed, a revision dir is created. [09:23:25] The revision dir includes 2 times the same files: [09:23:25] * 13GB of artifacts in artifacts/org/wikimedia/analytics/refinery [09:23:25] * 13GB of files (artifacts I think) in .git/fat/objects/ [09:23:25] I havn't found a git fat way to remove half of it... [09:23:26] The revision `bd39e672160695aaaa9d10a80fb812ac3fb1f6e2` is my deploy today. Its 7.8GB may not be representative. [09:23:27] Tell me when you have cleared some space. Thx [09:38:54] aqu: Will do. I'm intrigued by the `git fat gc` command that hashar mentioned above, but I thought that clearing some space from logs would be a good place to start, before we experiment with any more. [09:43:52] I am not sure `git fat gc` would make much changes though [09:44:23] each revision ends up in its own refinery-cache/ which would be the directory having the git fat object [09:44:46] and on the next deployment, only the current 2 caches are kept, the oldest one get deleted with (I assume) all their git fat objects) [09:45:09] so I am guessing the refinery-cache/ directory holding git fat objects only have the objects for the revision held in the dir [09:45:44] that is assuming there is one `.git/fat/objects/` per rev-cache dir [09:46:13] if it is somehow shared from the common base directory, then the objets would have pilled up there over time [09:47:32] what is possible is that the directory on the deployment server runs `git fat pull` thus populating the objects on the deployment server in /srv/deployment/<...refinery...>/.git/fat/objects [09:47:48] and those get rsynced to the target host, then I would expect scap to filter that out [09:48:00] I think scap is supposed to `git fat pull` on each of the deployment target [09:48:08] but I can be wrong [09:48:34] OK, I have removed the old logs, so we now have 49 GB (56%) available. [09:48:38] https://www.irccloud.com/pastebin/RBHChk0r/ [09:48:53] you can file a task with the details you found and people from release engineering team that are more familiar with scap/git fat might be able to give insights (they are on US west coast though, so still sleeping) [09:49:11] nice btullis ! [09:49:50] hashar, thanks for all of this. It's really useful. I think that we might need to get releng involved, as you suggest. [09:51:04] an-launcher1002 is actually quite an old machine and we have decided to replace it with a VM this year, rather than replace the hardware. So making sure that we are using the space efficiently will be quite important, I think. [09:51:30] deploy1002$ du -hs /srv/deployment/analytics/refinery/.git/fat/objects [09:51:30] 23G /srv/deployment/analytics/refinery/.git/fat/objects [09:52:24] then again I would expect scap to not rsync any of `.git/fat/*` but I don't know how it behaves [09:52:44] so you can file a task about the problem and add #scap to it + cc me ;) [09:55:11] `git fat status` does report a bunch of orphan objects ;] [10:07:07] qq though - was it expected that the difference in size between the last two refinery scap rev is so big? [10:11:22] aqu mentioned: [10:11:22] > The revision `bd39e672160695aaaa9d10a80fb812ac3fb1f6e2` is my deploy today. Its 7.8GB may not be representative. [10:12:36] So this is much smaller than the previous deploy on Jun 28, right? Maybe it was only a partial deploy? Do we need to do another scap deploy now that some additional space has been freed? [10:14:55] weird [10:16:58] Thx btullis, I will retry my deploy then. ok? [10:17:06] I think btullis is right - it must be a partial deploy - git fat stopped when disk was full [10:17:31] The next deploy should be run with force, to make sure all artifacts are re-downloaded [10:17:43] We do have backups of `deploy1002:/srv/deployment/analytics/refinery/.git/fat/objects` so if we think that this directory might have bellooned in size at some point, perhaps there is value in checking the old backups to see their sizes. [10:18:11] s/bellooned/ballooned [10:23:49] aqu: +1 from me on a force deploy as well. [10:35:37] It's balooning by ~200MB by release of refinery/source, currently. [10:45:47] a-team: As discussed yesterday I would like to push out new hadoop packages to the cluster today, based on version 1.10.2 [10:47:01] This will need a restart/failover/failback of hive, a restart of oozie, a rolling restart of the masters, and a rolling restart of the datanodes. I think that's the order that I'd go for. [10:48:17] Sorry, that should say version 2.10.2 [11:00:58] I'll wait until the next gobblin run has completed before doing anything, but I was wondering if I should temporarily disable all timers on an-launcher1002 while the restarts are happening. [11:01:38] joal: aqu: What do you think? Are you happy for me to proceed with the upgrade once the next gobblin run is complete? [11:20:37] All four gobblin timers fired and the services ran successfully. [11:30:58] I see that there are still a number of airflow jobs running as well, so I'm going to hold off on pushing out any updated packages. [11:37:22] 10Data-Engineering: Sanitize network_flows_internal dataset - https://phabricator.wikimedia.org/T312915 (10ayounsi) +1 to keep historical data, note that we're only collecting WMF IPs so there should be nothing to sanitize (no PII). It would be safer to double check it if it's not too costly though. [11:56:44] Hey btullis - sorry I was away for a bit [11:58:23] No worries. It's inspiration week too, so we're supposed to be out walking in the woods or whatever anyway :-) [12:00:29] * joal wonders about going waking in the woods :) [12:02:23] (03CR) 10Joal: "Reviewed the first 10 files - Some \-escape are not unescaped, and there are trailing spaces in almost all files." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/812095 (https://phabricator.wikimedia.org/T311507) (owner: 10NOkafor) [12:26:39] 10Data-Engineering-Kanban, 10Data-Catalog, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Integrate Superset with DataHub - https://phabricator.wikimedia.org/T306903 (10Ottomata) > On stat1008, gradlew fails because it launches a java process and I couldn't find a way to hack it to pass the w... [12:33:55] :q [12:34:08] Oops, wrong window. [12:34:16] ottomata, mforns: hello - I have a question for any of you :) [13:08:18] ping again ottomata or mforns, in case :) [13:08:30] heya joal ! [13:08:39] Hi mforns :) [13:08:55] I have a question on RefineSanitize if you have aminute? [13:09:53] yes joal gimme 2 mins [13:10:08] Ah! mforns no need actually, I found what I was after :) [13:10:20] Thanks anyhow :) [13:13:07] oh joal ok, let me know if there's sth else! [13:13:31] Thanks mforns - You solve my problems by me asking you about them - this is an incredible power :) [13:13:42] xD exactly [13:13:57] I'll have to start charging people :-) [13:18:34] (03PS11) 10Joal: Update refine to use Iceberg for event_sanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) [14:02:37] joal: hello oh! [14:19:05] !log Deployed refinery using scap, then deployed onto hdfs (prod + test) [14:19:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:20:03] (03PS12) 10Joal: Update refine to use Iceberg for event_sanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) [16:20:28] (03CR) 10Joal: Update refine to use Iceberg for event_sanitize (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) (owner: 10Joal) [16:21:04] ottomata: I'm interested by your check of that last patch - I ended up doing some stuff that was not really planned :) [16:23:51] 10Data-Engineering-Kanban, 10Airflow, 10Data Engineering Planning (Sprint 01): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10JAllemandou) Gerrit patch: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/812095 [16:51:16] (03PS4) 10Hashar: Schemas for Gerrit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) [16:51:56] (03CR) 10Hashar: "I have regenerated them from draft 2019 to draft-7. I still haven't set the "type" on the "const" elements." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [16:52:42] (03CR) 10CI reject: [V: 04-1] Schemas for Gerrit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [17:07:35] (03PS5) 10Hashar: Schemas for Gerrit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) [17:08:42] (03CR) 10CI reject: [V: 04-1] Schemas for Gerrit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [17:17:40] (03CR) 10Hashar: "I have:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [17:28:56] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: "Pages to date" not loading with "daily" metric - https://phabricator.wikimedia.org/T312717 (10Nevmit) Hi @Milimetric, can you do this API just "article". not all page types. [17:48:15] no luck generating examples for the json schema :-\ [17:48:39] but I got the const defined with `"type": "string"` so it is good progress :] [17:48:46] happy hacking & [17:50:37] ottomata: do you know what is: /wmf/data/raw/event/eqiad.mediawiki.recentchange/year=2015/month=12/day=17/hour=03 [17:50:49] It was imported some minutes ago... [17:51:11] I'm doing some deletion script tests, and the deletion script brought this up, as data to delete [18:00:58] (03CR) 10Ottomata: Update refine to use Iceberg for event_sanitize (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) (owner: 10Joal) [18:01:13] 2015!!! [18:02:03] mforns: the raw partitioning for that data is done by gobblin on meta.dt field [18:02:11] meta.dt field in this case is set by mediawiki! [18:02:18] which we assumed we could trust! [18:03:16] and indeed there is one event that has "dt": "2015-12-17T03:36:20Z", [18:03:20] mforns: i think it is fine to dlete [18:04:07] no idea why mediawiki would emit that now [18:04:15] joal: commented, one nit and q but looks good to me! [18:04:25] hashar: nice! [18:04:39] hashar: what do you mean generating examples? Its just failiung? [18:05:05] Thanks ottomata - reading :) [18:06:19] ottomata: do you wish we spend a minute on that find thing? [18:07:52] joal: sure! [18:08:01] let's batcave quickly [18:08:05] slack huddle plz! [18:08:06] invited [18:17:53] ottomata: ok! Now, that presents a problem, because it invalidates the security feature we added to the deletion script... [18:19:06] Are those old-timestamped events going to make it to the refined event database? [18:26:25] (03CR) 10Ottomata: Update refine to use Iceberg for event_sanitize (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) (owner: 10Joal) [18:28:58] ottomata: I imagine not, because Refine is only refining targets in the last 24 hours or so. [18:29:34] right. [18:29:42] ideally these events woudl be rejected by kakfa [18:30:32] https://phabricator.wikimedia.org/T282887 [18:31:24] mforns: that should really not happen [18:31:44] 👍 [18:32:05] 10Data-Engineering, 10Discovery, 10Event-Platform, 10SRE, 10Platform Team Workboards (Clinic Duty Team): Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata) Happened again today. There was a mediawiki.recentchange event with a 2015 timestamp. [18:32:50] thanks ottomata, I will then leave the raw events without an allowed-interval check for now, and when that task is tackled, I can add it. [18:33:38] (03CR) 10Ottomata: "This change is ready for review." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/813342 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [18:36:18] (03PS13) 10Joal: Update refine to use Iceberg for event_sanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) [18:36:27] (03CR) 10Joal: Update refine to use Iceberg for event_sanitize (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) (owner: 10Joal) [18:37:18] (03CR) 10Joal: [C: 03+1] "This is ready from a review perspective :) I'll continue my test batches on Friday, so far so good" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) (owner: 10Joal) [18:53:14] mforns: that task might not be tackled for a long time. [18:53:34] it kind of depends on https://phabricator.wikimedia.org/T267648 [18:53:38] which isn't really prioritized [19:30:30] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: "Pages to date" not loading with "daily" metric - https://phabricator.wikimedia.org/T312717 (10Milimetric) >>! In T312717#8075651, @Nevmit wrote: > Hi @Milimetric, can you do this API just "article". not all page types. Yes, @Nevmit, the page type para... [20:06:58] 10Quarry, 10Documentation-Review-Board, 10Key docs update 2021-22: Quarry docs - https://phabricator.wikimedia.org/T307011 (10TBurmeister) [20:07:32] 10Quarry, 10Documentation-Review-Board, 10Key docs update 2021-22: Quarry docs - https://phabricator.wikimedia.org/T307011 (10TBurmeister) a:03KBach [20:23:41] (03CR) 10Hashar: "I agree inferring the type from the value is poor. The "const" is really an "enum" with a single value, thus if I had a property such as:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/813342 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [20:29:43] (03CR) 10Ottomata: (DO NOT SUBMIT) spark: support "const" in json schema (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/813342 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [21:34:05] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: "Pages to date" not loading with "daily" metric - https://phabricator.wikimedia.org/T312717 (10Nevmit) Thank you @Milimetric. So, how long will the real problem be solved? [21:40:36] 10Data-Engineering, 10Product-Analytics: Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history - https://phabricator.wikimedia.org/T266374 (10Isaac) thanks @nettrom_WMF ! seeing that this task is about reverted edits as opposed to reverting edits (what I calculated above)... [21:54:22] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Technical-Debt: Migrate usage of Database::select to SelectQueryBuilder in EventLogging - https://phabricator.wikimedia.org/T312335 (10Southparkfan) 05Open→03Invalid Impacted code is not used by this extension. [22:36:06] (03CR) 10Hashar: (DO NOT SUBMIT) spark: support "const" in json schema (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/813342 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [22:36:11] (03Abandoned) 10Hashar: (DO NOT SUBMIT) spark: support "const" in json schema [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/813342 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [22:36:32] ottomata: thanks for the refinery review ;) [22:37:40] the json schema spec does not require a type for enum/const since the validation should be made against each of the values and I finally got that "type" is purely a hint even if it is not required by the spec [23:32:10] (03PS6) 10Hashar: Schemas for Gerrit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) [23:39:04] (03CR) 10Hashar: Schemas for Gerrit (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar)