[08:17:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "<3" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803618 (owner: 10Hoo man) [08:17:50] (03Merged) 10jenkins-bot: Update composer "require-dev" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803618 (owner: 10Hoo man) [08:18:14] (03PS1) 10Lucas Werkmeister (WMDE): Update composer "require-dev" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803542 [08:18:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Update composer "require-dev" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803542 (owner: 10Lucas Werkmeister (WMDE)) [08:18:54] (03PS2) 10Lucas Werkmeister (WMDE): Sum site_stats rows [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803458 (https://phabricator.wikimedia.org/T310043) [08:19:32] (03Merged) 10jenkins-bot: Update composer "require-dev" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803542 (owner: 10Lucas Werkmeister (WMDE)) [08:50:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "reapplying +2" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803458 (https://phabricator.wikimedia.org/T310043) (owner: 10Lucas Werkmeister (WMDE)) [08:51:23] (03Merged) 10jenkins-bot: Sum site_stats rows [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803458 (https://phabricator.wikimedia.org/T310043) (owner: 10Lucas Werkmeister (WMDE)) [08:51:57] (03PS1) 10Lucas Werkmeister (WMDE): Sum site_stats rows [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803543 (https://phabricator.wikimedia.org/T310043) [08:52:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Sum site_stats rows [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803543 (https://phabricator.wikimedia.org/T310043) (owner: 10Lucas Werkmeister (WMDE)) [08:52:40] (03Merged) 10jenkins-bot: Sum site_stats rows [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803543 (https://phabricator.wikimedia.org/T310043) (owner: 10Lucas Werkmeister (WMDE)) [09:27:54] I will shortly start the analytics weekly deployment train. I have both refinery-source and refinery to deploy. Anyone have anything to add before I start? [09:32:57] Starting build #106 for job analytics-refinery-maven-release-docker [09:46:04] Project analytics-refinery-maven-release-docker build #106: 09SUCCESS in 13 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/106/ [10:08:24] Starting build #64 for job analytics-refinery-update-jars-docker [10:08:55] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.1 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/803881 [10:08:56] Project analytics-refinery-update-jars-docker build #64: 09SUCCESS in 31 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/64/ [10:10:55] I am a bit confused by some of the deployment steps here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery-source#How_to_deploy_with_Jenkins_(and_related_steps) [10:11:57] ...so I could do with some help. Specifically my update-jars-docker job had a lot of 404s in the log: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/64/console [10:12:51] Therefore I might have got the version number wrong from the first step. This is my first time doing a refinery-source deploy. [10:34:14] This is the CR that was created in refinery: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/803881 [11:42:32] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:50:16] Hi btullis - [11:50:32] Hi joal [11:50:38] sorry, kid's day today - Has the release worked with jenkins finally? [11:51:47] I'm not sure. I read all of the instructions but I am still confused about version numbers. This has lots of 404s in the logs: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/64/console [11:53:48] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:54:58] From the end of the log, it seems to have succeeded - Will check on archiva [11:55:02] I don't know if I was supposed to do another version bump of refinery-source - You already updated the changelog.md to version 0.2.0 but when I ran the analytics-refinery-maven-release-docker job it said that it was building 0.2.1 and I'm not sure why. [11:55:27] Thanks ever so much. [11:56:02] btullis: You should have changed the changelog to 0.2.1, as this was the version you were about to release [11:57:17] btullis: the artifact has been released to archiva - it's all good on that front [12:02:18] OK, right. I didn't get that I should bump the version for a release. I thought because you had already merged 0.2.0 that was the version being released. [12:03:01] This first paragraph is ultra-confusing to me: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery-source#How_to_deploy_with_Jenkins_(and_related_steps) [12:04:23] (03CR) 10Hoo man: [C: 03+2] Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/787905 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [12:04:28] btullis: the current version of the project (here: https://github.com/wikimedia/analytics-refinery-source/blob/master/pom.xml#L10) tells you wish version will be released (the process is: SNAPSHOT -> RELEASE) [12:04:52] So now we are at 0.2.2-SNAPSHOT, next release will be 0.2.2 [12:05:02] (03Merged) 10jenkins-bot: Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/787905 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [12:06:02] (03PS1) 10Hoo man: Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803545 (https://phabricator.wikimedia.org/T201491) [12:06:21] When you deployed, the version in the pom was 0.2.1-SNAPSHOT, so you have released 0.2.1 - And the change in the changelog.md file would have been to match that [12:06:33] I'm gonna provide that patch so that you see [12:08:35] (03CR) 10Hoo man: [C: 03+1] "Don't have +2 here" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803545 (https://phabricator.wikimedia.org/T201491) (owner: 10Hoo man) [12:10:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Discovery, 10Generated Data Platform: Agree on and adopt WMF scalastyle conventions - https://phabricator.wikimedia.org/T310143 (10Ottomata) [12:10:45] (03PS1) 10Joal: Update changelog.md for v0.2.1 after release [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/803897 [12:11:02] btullis: --^ [12:15:07] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/802885 (owner: 10Joal) [12:22:13] joal: many thanks. Will take a look shortly. [12:22:50] btullis: let's batcave if/when you wish to review the process [12:31:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I’ll merge this to match master, but “separatly” is still not correctly spelled." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803545 (https://phabricator.wikimedia.org/T201491) (owner: 10Hoo man) [12:31:12] (03CR) 10Lucas Werkmeister (WMDE): Fix typo (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/787905 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [12:31:49] (03Merged) 10jenkins-bot: Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803545 (https://phabricator.wikimedia.org/T201491) (owner: 10Hoo man) [12:32:33] (03CR) 10Hoo man: [C: 03+1] Fix typo (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803545 (https://phabricator.wikimedia.org/T201491) (owner: 10Hoo man) [12:44:48] (03CR) 10Klein Muçi: "This change is ready for review." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803906 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [13:19:31] joal, I'm back now if you'd like to batcave. [13:19:45] let's do that [13:27:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803906 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [13:28:25] (03Merged) 10jenkins-bot: Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803906 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [13:29:33] (03PS1) 10Lucas Werkmeister (WMDE): Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803907 (https://phabricator.wikimedia.org/T201491) [13:30:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803907 (https://phabricator.wikimedia.org/T201491) (owner: 10Lucas Werkmeister (WMDE)) [13:30:48] (03Merged) 10jenkins-bot: Fix typo [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/803907 (https://phabricator.wikimedia.org/T201491) (owner: 10Lucas Werkmeister (WMDE)) [13:38:52] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add refinery-source jars for v0.2.1 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/803881 (owner: 10Maven-release-user) [13:45:40] !log deploying refinery [13:45:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:52:29] I do like a pretty graph: https://grafana.wikimedia.org/d/ZvSPbGOnz/hadoop-server-utilization-btullis?orgId=1&from=1654571216519&to=1654576349226&forceLogin&viewPanel=23 [13:58:19] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: [Shared Event Platform] - Research Flink Changelog semantics to inform POC MW schema design - https://phabricator.wikimedia.org/T310082 (10Ottomata) Just discussed a bit with @dcausse. I think we know enough now to make a decision on... [14:20:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: [Shared Event Platform] - Research Flink Changelog semantics to inform POC MW schema design - https://phabricator.wikimedia.org/T310082 (10Ottomata) Additionally, we could adopt some kind of convention for designating a 'primary key' i... [14:41:59] (03CR) 10Klein Muçi: Fix typo (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/787905 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [14:48:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: [Shared Event Platform] - Research Flink Changelog semantics to inform POC MW schema design - https://phabricator.wikimedia.org/T310082 (10Ottomata) Q: does this field belong in `meta`? Probably yes, but if I could go back now I would... [15:07:42] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging post-release" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/803897 (owner: 10Joal) [15:42:13] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:42:13] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:42:34] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:42:40] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:42:50] aouch [15:43:06] ---^ This is me. It shouldn't be an issue, as it is the old AQS, but I'm looking now. [15:43:14] thanks btullis [15:43:38] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:43:46] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:44:16] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:44:16] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:52:56] RECOVERY - Check unit status of refinery-sqoop-whole-mediawiki on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:08:21] 10Data-Engineering, 10Event-Platform: jsonschema-tools tests should fail if schema $id does not match title or path - https://phabricator.wikimedia.org/T300404 (10phuedx) [16:43:18] 10Data-Engineering, 10Data-Engineering-Kanban, 10Platform Engineering, 10Product-Analytics: AQS `edited-pages/new` metric does not make clear that the value is net of deletions - https://phabricator.wikimedia.org/T240860 (10JArguello-WMF) [16:55:40] Hey mforns - now that the train has passed, would ou be ok with me merging and deploing some airflow code? [17:28:25] heya sorry joal missed the ping [17:28:34] np mforns :) [17:28:39] yes, let's merge [17:28:41] pinging again! [17:28:43] :D [17:29:11] Awesome - Let's do spark3 change first if ok for you? [17:29:11] what should we merge? [17:29:15] all spark3 changes? [17:30:04] I was thinking only the spark3 update first [17:30:06] mforns: -^ [17:30:13] to make sure stuff works [17:30:14] ok [17:30:21] But, we can also bundle [17:30:39] Since deploys are cheap, I like doing 1 by 1 dpeloys :) [17:30:40] this one right? https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/76 [17:30:46] correctn [17:30:52] it's a good idea to make separate deploys, since they are fast [17:30:57] ok, merging [17:30:58] I can do it if you wish - I just wanted our approval :) [17:31:05] Thank you! [17:31:33] arg 2fa [17:32:15] merged [17:33:02] joal: we can re-run an hourly task [17:33:07] to test [17:33:26] mforns: I'm gonna deploy and rerun an AQS hourly task indeed [17:33:34] ok [17:34:20] mforns: Actually, possibly we could also merge other ones - such as this one: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/67 [17:35:16] joal: merged [17:35:16] And actually, I wonder if I'd not be ok to merge all my PRs at once, and deploy and test [17:36:01] Ah! I need a change [17:36:07] so no merging for the other ones :) [17:37:03] mforns: we can go for https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/69 [17:37:11] and https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/70 [17:37:32] number 66 needs a change - doing it now [17:38:02] merged 69 [17:38:38] also 70 [17:41:33] mforns: I just updated 66 with rebase + latest fix [17:41:42] k [17:42:36] waiting for pipeline [17:55:17] joal: done [18:00:23] ok mforns - I'm gonna deploy that [18:00:29] k [18:00:46] * joal is a bit sweaty with such a big deploy [18:02:33] !log deploying Airflow dags [18:02:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:05:05] mforns: facing a similar issue to the other day: some jobs don't show up in the UI anymore [18:05:15] hm [18:05:17] mforns: is it ok for you if I restart airflow? [18:05:18] looking [18:05:23] yes [18:06:15] !log Restart airflow after deploy for dag reprocessing [18:06:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:09:09] mforns: still issue :( [18:09:16] right [18:10:09] mforns: do you have a trick to force airflow update itself? [18:10:20] nope [18:10:41] the other times it updated itself were after freeing resources on the machine [18:11:18] joal: we can try again to stop airflow, check that there are no orphaned or zombie airflow processes, and then restart [18:11:26] ack [18:12:48] mforns: we have 16 dags in UI, 18 dag files parsed :( [18:12:48] I'll do that [18:12:57] ??? [18:13:03] yup [18:13:42] I see only 8/18 DAGs parsed [18:14:20] Ah! I think I was not reading the logs correctly [18:14:22] batcave? [18:15:10] yes! [18:32:51] heya ottomata can I give you a puppet patch for an airflow fix to review?? :] [18:37:36] joal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/803973 [18:41:38] ottomata: if you have time :] https://gerrit.wikimedia.org/r/c/operations/puppet/+/803973 [18:42:02] This should prevent the DAG parsing issues we've been seeing. [18:50:37] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) [18:50:50] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) a:05gmodena→03Ottomata [18:54:08] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) I had originally leaned towarrd option 2. custom serialization format, but the more I think about it, this is not about serializati... [19:02:09] Actually mforns, I think Antoine patch for Clickstream has been released! [19:02:19] mforns: I'm gonna retry the task - ok? [19:02:24] ah! ok! [19:03:08] Done mforns [19:03:27] cool [19:03:33] will keep an eye on that [19:04:46] Thanks a lot mforns <3 [19:04:52] Gone for today folks [19:04:57] byeeeee :] [19:26:08] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) Doing it at pipeline step level will be a little hard though. We'd likely want to work at Table API level, so that both DataStream... [21:51:54] 10Quarry: Prettify Quarry's "User not found" page - https://phabricator.wikimedia.org/T134661 (10Abstract09) a:03Abstract09 [23:21:50] 10Data-Engineering, 10Event-Platform, 10User-Elukey: Create EventStream's equivalent to irc.wikimedia.org's #central channel - https://phabricator.wikimedia.org/T240182 (10JArguello-WMF) [23:22:05] 10Data-Engineering, 10Cassandra, 10Pageviews-API, 10User-Elukey: Improve user management for AQS Cassandra - https://phabricator.wikimedia.org/T142073 (10JArguello-WMF)