[01:49:20] 10Analytics, 10Analytics-Kanban: Improve Refine bad data handling - https://phabricator.wikimedia.org/T289003 (10Milimetric) That sounds right. I guess it starts from https://github.com/wikimedia/analytics-refinery/blob/fc4d873c597771aeb31d75a81a4b44ddc2233c48/oozie/virtualpageview/hourly/virtualpageview_hour... [02:25:25] I have 5 tabs open right now. Freedom is so sweet. So sweet :P [06:26:44] Good morning [06:27:09] bonjour! [06:27:38] How are you elukey ? [06:31:27] good! I am still trying to figure out how the hackathon works :D [06:31:29] and you? [06:39:20] All good elukey :) [06:39:46] I think we need to figure out the project we wish to work on - but to me this will be difficult without others around and some smalltalk :) [06:41:42] yeah but for example I'd like to work on pontoon for the testing infra, and I'll probably do it alone :D [06:42:01] for the GPU stuff there are people interested but not sure when etc.. [06:42:28] hm - I'm gonna wait to see what project have some traction in term of people, and will join one of them (on the direct purpose of not working alone :) [06:43:20] sure but the most popular ones are not that interesting for me, and hackathon means also to work on things that interest you :D [06:43:32] true [06:56:02] !log Kill-restart pageview-monthly_dump-coord to apply fix for SLA [06:56:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:06:58] (03PS1) 10Joal: Grow mediwiki-history-reduced spark ressources [analytics/refinery] - 10https://gerrit.wikimedia.org/r/725669 [07:07:52] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging hotfix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/725669 (owner: 10Joal) [07:10:16] !log Deploy refinery for mediawiki-history-reduced hotfix [07:10:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:14:45] (03PS1) 10Ladsgroup: Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725416 (https://phabricator.wikimedia.org/T291276) [07:14:53] (03CR) 10Ladsgroup: [C: 03+2] Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725416 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [07:16:25] (03Merged) 10jenkins-bot: Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725416 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [07:26:57] 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10elukey) @BTullis @razzi can you sync with Chris to perform this maintenance during the next days? [07:31:17] 10Analytics: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10elukey) [07:32:01] !log Deploy refinery to hdfs [07:32:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:35:56] 10Analytics: Automate kerberos credential creation and management to ease the creation of testing infrastructure - https://phabricator.wikimedia.org/T292389 (10elukey) [07:36:34] 10Analytics: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10elukey) The first big problem to solve is the one outlined in T292389, namely automate the bootstrap of a Krb KDC stack. [07:43:22] !log Kill-restart mediawiki-history-reduced job after deploy (more ressources) [07:43:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:02:31] (03PS1) 10Ladsgroup: Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725693 (https://phabricator.wikimedia.org/T291276) [08:04:17] joal: if you have time there is the DSE sync [09:01:20] 10Analytics: Un-fork analytics/gobblin - https://phabricator.wikimedia.org/T292396 (10Gehel) [09:38:57] elukey: I was in taichi :S [09:39:51] (03CR) 10Michael Große: [C: 03+2] Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725693 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [09:40:08] joal: ahhh yes I forgot sorry! [09:40:16] np elukey :) [09:41:09] joal: there was an issue earlier on with a spark job ran by gmodena, it caused some extra traffic in network links ending up saturating (partially) some of them (and paging SRE) [09:41:14] (03Merged) 10jenkins-bot: Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725693 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [09:41:28] Arf - currently talking with him - will investigate [09:41:47] I think it may be due to data shuffling or similar, I am wondering if we could follow up together and see if there are some good standard recommendations to make [09:41:56] (the job was sized correctly, it was just the network usage) [09:42:09] hm [09:42:25] on the SRE side, we'll try to add some QoS in the future (Arzhel/Cathal will open tasks) [09:50:37] elukey: we're reviwing the job, and it feels weird that this one paged :( [09:53:49] ^ clarakosi: FYI [09:59:33] elukey: my assumption is that the cluster was busy, and that gmodena's job just made the thing go over threshold while not being that big - It'll be good of we can get some more feedback on how often this happens [10:16:19] joal: very rarely, the last time that happened it was to misconfigured distributed tf jobs [10:16:39] right - this is weird elukey :() [10:19:24] elukey: we're re-running the job and monitor, to see if it happens again [10:30:01] joal: on an-worker1113 (that was pushing a lot of data IIRC) I recall seeing [10:30:04] https://yarn.wikimedia.org/jobhistory/job/job_1632476005296_66307 [10:30:08] https://yarn.wikimedia.org/cluster/app/application_1632476005296_66309 [10:30:21] they were pretty big jobs in Yarn at the time, when I checked [10:30:43] exactly elukey - My assumption is that the problem what due to mediawiki-history-reduced job [10:31:33] elukey: I had restarted the job after a failure, and there were some other user queries running at the same time [10:33:11] okok perfect, it makes sense then, but it is not a relief :( [10:33:22] yeah, I agree elukey :( [10:33:39] elukey: maybe 10G in a shared network is not appropriate :S [10:43:52] joal: there are some network-level follow ups to do with Arzhel and Cathal, in the future we may end up in a better situation, but it will require some time [10:44:19] ack elukey - I'm interested in being kept in the loop, just to follow up :) [10:45:10] joal: of course! Mostly, at high level, bigger inter-network links (100G) + QoS (still very ignorant about it, but it seems a good alternative, according to our netengs, to relocating hadoop hosts in separate racks) [10:45:21] (lunch bbl) [10:45:48] ack elukey - enjoy food - later :) [12:25:43] (03PS1) 10Ladsgroup: Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725425 (https://phabricator.wikimedia.org/T291276) [12:25:50] (03CR) 10Ladsgroup: [C: 03+2] Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725425 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [12:26:42] (03Merged) 10jenkins-bot: Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725425 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [12:32:17] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Ottomata) Thank you!!! [13:24:46] (03PS1) 10Gerrit maintenance bot: Add ami.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/725883 (https://phabricator.wikimedia.org/T292421) [13:25:25] folks is it ok if I failover analytics-hive to an-coord1002? (via DNS) [13:25:32] to finish the openjdk restarts [13:36:50] 10Analytics-Radar, 10Patch-For-Review: Update ROCm version on GPU instances. - https://phabricator.wikimedia.org/T287267 (10elukey) Things to review: 3.8 -> 4.3.1 https://rocmdocs.amd.com/en/latest/Current_Release_Notes/ROCm-Version-History.html https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Curren... [13:37:06] 10Analytics, 10Analytics-Kanban: Improve Refine bad data handling - https://phabricator.wikimedia.org/T289003 (10Ottomata) > and passes that through to Refine as nonsense, resulting in the is_wmf_domain: false, probably. Hm, no, so something in `get_pageview_info` is getting that, not Refine. Ok, then next is... [13:39:21] elukey: no rpblem for me! [13:43:09] joal: done! will wait a bit and the restart daemons on an-coord1001 [13:43:23] !log failover analytics-hive to an-coord1002 (to restart java daemons on 1001) [13:43:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:45:42] Thanks ottomata for the rerun of the job - I was about to ask [13:46:00] sho thang! :) [13:46:20] ottomata: How do we know which of the refine we should run? [13:46:35] the subject line of the email [13:46:57] makes sense ottomata :) [13:46:57] the spark job is not aware of the name of the wrapper script, so it was hard to make a full CLI to paste [13:47:05] that would be better [13:47:09] thanks for that I'll try to remember [13:47:24] you probably have told me already but eh... [13:47:36] yeah i only remember because i wrote it, if i didn't... [13:48:36] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine#Administration links to https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine.pp which has docs about the differnet jobs [13:49:38] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/725883 (https://phabricator.wikimedia.org/T292421) (owner: 10Gerrit maintenance bot) [15:27:10] a-team: helloooo - asking permission to do the following: 1) to avoid disrupting too many people on stat100[5,8] without some heads up time, I'd like to upgrade the ROCm GPU drivers on one of the hadoop worker nodes with a GPU tomorrow morning (it requires at least one reboot) 2) if nothing goes horribly, upgrade stat1005 on Wednesday (sending an email to analytics-announce@ today) [15:27:37] Miriam and Aiko will work with me to ensure that the new drivers works, then we'll try to experiment with tf [15:27:46] (anybody is welcome anytime of course) [15:28:12] long term - what if we bought a standard misc node with a GPU calling it stat-test1001 or similar? [15:28:15] elukey: Would it be worth decommisionning one node with a GPU for you to test? [15:28:36] joal: nah I think it is fine to just stop yarn + hdfs as we do usually [15:29:03] all good on my side elukey - if you encounter issues, please feel free to grab a host and play :) [15:29:12] elukey: all monthly jobs are past now ;) [15:29:19] I can keep them stopped via systemctl mask, so that reboots can happen freely [15:29:25] ack thanks :) [15:30:44] 10Analytics-Radar, 10Event-Platform, 10WMF-JobQueue, 10Wikibase change dispatching scripts to jobs, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ladsgroup) We ran into this issue again today, we introduced a new job, let's call it A that queues jobs in other w... [15:36:24] 10Analytics-Radar, 10Event-Platform, 10WMF-JobQueue, 10Wikibase change dispatching scripts to jobs, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ladsgroup) Job A and Job B: - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20promet... [15:58:17] +1 go for it elukey! [16:01:41] ack thanks! [16:16:19] Spark2-3 [16:16:23] woops [16:31:10] (03CR) 10MewOphaswongse: [C: 03+2] Add structured_task/article/image_suggestion_interaction/1.0.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725383 (owner: 10Gergő Tisza) [16:31:48] (03Merged) 10jenkins-bot: Add structured_task/article/image_suggestion_interaction/1.0.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725383 (owner: 10Gergő Tisza) [16:38:42] (03PS2) 10Nettrom: Update documentation for anonymous_user_token [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) [16:53:16] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:56:32] !log restart java daemons on an-coord1001 (standby) [16:56:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:58:10] all up and running :) [16:58:17] will to the failover 1002 -> 1001 tomorrow [17:04:22] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:19:20] 10Quarry: Query that worked previously is now reproducibly "stopped" - https://phabricator.wikimedia.org/T292470 (10Cirdan) [20:11:38] 10Analytics, 10Product-Analytics, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nettrom_WMF) [22:13:46] 10Analytics-Radar, 10SRE, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) Currently working on T288844 and added puppet code that allowed us to use a second, new, license for MaxMind geoip databases. So far e...