[01:49:20] <wikibugs>	 10Analytics, 10Analytics-Kanban: Improve Refine bad data handling - https://phabricator.wikimedia.org/T289003 (10Milimetric) That sounds right.  I guess it starts from https://github.com/wikimedia/analytics-refinery/blob/fc4d873c597771aeb31d75a81a4b44ddc2233c48/oozie/virtualpageview/hourly/virtualpageview_hour...
[02:25:25] <milimetric>	 I have 5 tabs open right now.  Freedom is so sweet.  So sweet :P
[06:26:44] <joal>	 Good morning
[06:27:09] <elukey>	 bonjour!
[06:27:38] <joal>	 How are you elukey ?
[06:31:27] <elukey>	 good! I am still trying to figure out how the hackathon works :D
[06:31:29] <elukey>	 and you?
[06:39:20] <joal>	 All good elukey :)
[06:39:46] <joal>	 I think we need to figure out the project we wish to work on - but to me this will be difficult without others around and some smalltalk :)
[06:41:42] <elukey>	 yeah but for example I'd like to work on pontoon for the testing infra, and I'll probably do it alone :D
[06:42:01] <elukey>	 for the GPU stuff there are people interested but not sure when etc..
[06:42:28] <joal>	 hm - I'm gonna wait to see what project have some traction in term of people, and will join one of them (on the direct purpose of not working alone :)
[06:43:20] <elukey>	 sure but the most popular ones are not that interesting for me, and hackathon means also to work on things that interest you :D
[06:43:32] <joal>	 true
[06:56:02] <joal>	 !log Kill-restart pageview-monthly_dump-coord to apply fix for SLA
[06:56:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:06:58] <wikibugs>	 (03PS1) 10Joal: Grow mediwiki-history-reduced spark ressources [analytics/refinery] - 10https://gerrit.wikimedia.org/r/725669
[07:07:52] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging hotfix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/725669 (owner: 10Joal)
[07:10:16] <joal>	 !log Deploy refinery for mediawiki-history-reduced hotfix
[07:10:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:14:45] <wikibugs>	 (03PS1) 10Ladsgroup: Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725416 (https://phabricator.wikimedia.org/T291276)
[07:14:53] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725416 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup)
[07:16:25] <wikibugs>	 (03Merged) 10jenkins-bot: Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725416 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup)
[07:26:57] <wikibugs>	 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10elukey) @BTullis @razzi can you sync with Chris to perform this maintenance during the next days?
[07:31:17] <wikibugs>	 10Analytics: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10elukey)
[07:32:01] <joal>	 !log Deploy refinery to hdfs
[07:32:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:35:56] <wikibugs>	 10Analytics: Automate kerberos credential creation and management to ease the creation of testing infrastructure - https://phabricator.wikimedia.org/T292389 (10elukey)
[07:36:34] <wikibugs>	 10Analytics: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10elukey) The first big problem to solve is the one outlined in T292389, namely automate the bootstrap of a Krb KDC stack.
[07:43:22] <joal>	 !log Kill-restart mediawiki-history-reduced job after deploy (more ressources)
[07:43:25] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:02:31] <wikibugs>	 (03PS1) 10Ladsgroup: Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725693 (https://phabricator.wikimedia.org/T291276)
[08:04:17] <elukey>	 joal: if you have time there is the DSE sync
[09:01:20] <wikibugs>	 10Analytics: Un-fork analytics/gobblin - https://phabricator.wikimedia.org/T292396 (10Gehel)
[09:38:57] <joal>	 elukey: I was in taichi :S
[09:39:51] <wikibugs>	 (03CR) 10Michael Große: [C: 03+2] Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725693 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup)
[09:40:08] <elukey>	 joal: ahhh yes I forgot sorry!
[09:40:16] <joal>	 np elukey :)
[09:41:09] <elukey>	 joal: there was an issue earlier on with a spark job ran by gmodena, it caused some extra traffic in network links ending up saturating (partially) some of them (and paging SRE)
[09:41:14] <wikibugs>	 (03Merged) 10jenkins-bot: Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725693 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup)
[09:41:28] <joal>	 Arf - currently talking with him - will investigate
[09:41:47] <elukey>	 I think it may be due to data shuffling or similar, I am wondering if we could follow up together and see if there are some good standard recommendations to make
[09:41:56] <elukey>	 (the job was sized correctly, it was just the network usage)
[09:42:09] <joal>	 hm
[09:42:25] <elukey>	 on the SRE side, we'll try to add some QoS in the future (Arzhel/Cathal will open tasks)
[09:50:37] <joal>	 elukey: we're reviwing the job, and it feels weird that this one paged :(
[09:53:49] <gmodena>	 ^ clarakosi: FYI
[09:59:33] <joal>	 elukey: my assumption is that the cluster was busy, and that gmodena's job just made the thing go over threshold while not being that big - It'll be good of we can get some more feedback on how often this happens
[10:16:19] <elukey>	 joal: very rarely, the last time that happened it was to misconfigured distributed tf jobs
[10:16:39] <joal>	 right - this is weird elukey :()
[10:19:24] <joal>	  elukey: we're re-running the job and monitor, to see if it happens again
[10:30:01] <elukey>	 joal: on an-worker1113 (that was pushing a lot of data IIRC) I recall seeing
[10:30:04] <elukey>	 https://yarn.wikimedia.org/jobhistory/job/job_1632476005296_66307
[10:30:08] <elukey>	 https://yarn.wikimedia.org/cluster/app/application_1632476005296_66309
[10:30:21] <elukey>	 they were pretty big jobs in Yarn at the time, when I checked
[10:30:43] <joal>	 exactly elukey - My assumption is that the problem what due to mediawiki-history-reduced job
[10:31:33] <joal>	 elukey: I had restarted the job after a failure, and there were some other user queries running at the same time
[10:33:11] <elukey>	 okok perfect, it makes sense then, but it is not a relief :(
[10:33:22] <joal>	 yeah, I agree elukey :(
[10:33:39] <joal>	 elukey: maybe 10G in a shared network is not appropriate :S
[10:43:52] <elukey>	 joal: there are some network-level follow ups to do with Arzhel and Cathal, in the future we may end up in a better situation, but it will require some time
[10:44:19] <joal>	 ack elukey - I'm interested in being kept in the loop, just to follow up :)
[10:45:10] <elukey>	 joal: of course! Mostly, at high level, bigger inter-network links (100G) + QoS (still very ignorant about it, but it seems a good alternative, according to our netengs, to relocating hadoop hosts in separate racks)
[10:45:21] <elukey>	 (lunch bbl)
[10:45:48] <joal>	 ack elukey - enjoy food - later :)
[12:25:43] <wikibugs>	 (03PS1) 10Ladsgroup: Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725425 (https://phabricator.wikimedia.org/T291276)
[12:25:50] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725425 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup)
[12:26:42] <wikibugs>	 (03Merged) 10jenkins-bot: Fix metrics building in wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/725425 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup)
[12:32:17] <wikibugs>	 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Ottomata) Thank you!!!
[13:24:46] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add ami.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/725883 (https://phabricator.wikimedia.org/T292421)
[13:25:25] <elukey>	 folks is it ok if I failover analytics-hive to an-coord1002? (via DNS)
[13:25:32] <elukey>	 to finish the openjdk restarts
[13:36:50] <wikibugs>	 10Analytics-Radar, 10Patch-For-Review: Update ROCm version on GPU instances. - https://phabricator.wikimedia.org/T287267 (10elukey) Things to review: 3.8 -> 4.3.1  https://rocmdocs.amd.com/en/latest/Current_Release_Notes/ROCm-Version-History.html  https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Curren...
[13:37:06] <wikibugs>	 10Analytics, 10Analytics-Kanban: Improve Refine bad data handling - https://phabricator.wikimedia.org/T289003 (10Ottomata) > and passes that through to Refine as nonsense, resulting in the is_wmf_domain: false, probably. Hm, no, so something in `get_pageview_info` is getting that, not Refine.  Ok, then next is...
[13:39:21] <joal>	 elukey: no rpblem for me!
[13:43:09] <elukey>	 joal: done! will wait a bit and the restart daemons on an-coord1001
[13:43:23] <elukey>	 !log failover analytics-hive to an-coord1002 (to restart java daemons on 1001)
[13:43:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:45:42] <joal>	 Thanks ottomata for the rerun of the job - I was about to ask
[13:46:00] <ottomata>	 sho thang! :)
[13:46:20] <joal>	 ottomata: How do we know which of the refine we should run?
[13:46:35] <ottomata>	 the subject line of the email
[13:46:57] <joal>	 makes sense ottomata :)
[13:46:57] <ottomata>	 the spark job is not aware of the name of the wrapper script, so it was hard to make a full CLI to paste
[13:47:05] <ottomata>	 that would be better
[13:47:09] <joal>	 thanks for that I'll try to remember
[13:47:24] <joal>	 you probably have told me already but eh...
[13:47:36] <ottomata>	 yeah i only remember because i wrote it, if i didn't...
[13:48:36] <ottomata>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine#Administration links to https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine.pp which has docs about the differnet jobs
[13:49:38] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for later deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/725883 (https://phabricator.wikimedia.org/T292421) (owner: 10Gerrit maintenance bot)
[15:27:10] <elukey>	 a-team: helloooo - asking permission to do the following: 1) to avoid disrupting too many people on stat100[5,8] without some heads up time, I'd like to upgrade the ROCm GPU drivers on one of the hadoop worker nodes with a GPU tomorrow morning (it requires at least one reboot) 2) if nothing goes horribly, upgrade stat1005 on Wednesday (sending an email to analytics-announce@ today)
[15:27:37] <elukey>	 Miriam and Aiko will work with me to ensure that the new drivers works, then we'll try to experiment with tf
[15:27:46] <elukey>	 (anybody is welcome anytime of course)
[15:28:12] <elukey>	 long term - what if we bought a standard misc node with a GPU calling it stat-test1001 or similar?
[15:28:15] <joal>	 elukey: Would it be worth decommisionning one node with a GPU for you to test?
[15:28:36] <elukey>	 joal: nah I think it is fine to just stop yarn + hdfs as we do usually
[15:29:03] <joal>	 all good on my side elukey - if you encounter issues, please feel free to grab a host and play :)
[15:29:12] <joal>	 elukey: all monthly jobs are past now ;)
[15:29:19] <elukey>	 I can keep them stopped via systemctl mask, so that reboots can happen freely
[15:29:25] <elukey>	 ack thanks :)
[15:30:44] <wikibugs>	 10Analytics-Radar, 10Event-Platform, 10WMF-JobQueue, 10Wikibase change dispatching scripts to jobs, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ladsgroup) We ran into this issue again today, we introduced a new job, let's call it A that queues jobs in other w...
[15:36:24] <wikibugs>	 10Analytics-Radar, 10Event-Platform, 10WMF-JobQueue, 10Wikibase change dispatching scripts to jobs, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ladsgroup) Job A and Job B:  - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20promet...
[15:58:17] <ottomata>	 +1 go for it elukey!
[16:01:41] <elukey>	 ack thanks!
[16:16:19] <joal>	 Spark2-3
[16:16:23] <joal>	 woops
[16:31:10] <wikibugs>	 (03CR) 10MewOphaswongse: [C: 03+2] Add structured_task/article/image_suggestion_interaction/1.0.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725383 (owner: 10Gergő Tisza)
[16:31:48] <wikibugs>	 (03Merged) 10jenkins-bot: Add structured_task/article/image_suggestion_interaction/1.0.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725383 (owner: 10Gergő Tisza)
[16:38:42] <wikibugs>	 (03PS2) 10Nettrom: Update documentation for anonymous_user_token [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209)
[16:53:16] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:56:32] <elukey>	 !log restart java daemons on an-coord1001 (standby)
[16:56:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:58:10] <elukey>	 all up and running :)
[16:58:17] <elukey>	 will to the failover 1002 -> 1001 tomorrow
[17:04:22] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:19:20] <wikibugs>	 10Quarry: Query that worked previously is now reproducibly "stopped" - https://phabricator.wikimedia.org/T292470 (10Cirdan)
[20:11:38] <wikibugs>	 10Analytics, 10Product-Analytics, 10wmfdata-python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nettrom_WMF)
[22:13:46] <wikibugs>	 10Analytics-Radar, 10SRE, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) Currently working on T288844 and added puppet code that allowed us to use a second, new, license for MaxMind geoip databases. So far e...