[05:24:24] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Replica templatelinks table is broken for some sites - https://phabricator.wikimedia.org/T317258 (10Ladsgroup) I'll run maintain views on wikis it's broken.
[05:25:46] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Replica templatelinks table is broken for some sites - https://phabricator.wikimedia.org/T317258 (10Marostegui) The view is broken and that's why you got `  SELECT * FROM templatelinks LIMIT 5; ERROR 1356 (HY000): View 'frwiki_p.te...
[05:26:19] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Replica templatelinks table is broken for some sites - https://phabricator.wikimedia.org/T317258 (10Marostegui) 05Open→03Resolved a:03Marostegui Should be fixed now
[06:54:52] <wikibugs>	 10Data-Engineering: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10elukey) >>! In T317182#8217435, @Ottomata wrote: > +1 >  >> if external resources (eg. git) needs to be fetched from the Internet > JVM Dependencies are automatically fetched from the internet (maven central usua...
[07:28:32] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10dcausse) @Ottomata just to clarify: the new page_kind `visibility_change` will only be emitted for the...
[10:10:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp3057 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3057%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:15:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp3057 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3057%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:43:28] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Population input metrics - https://phabricator.wikimedia.org/T309279 (10ntsako) Work on including data specified on Asana `ntsako.population_data` has bee replaced by `ntsako.population_input_metrics`
[11:25:41] <wikibugs>	 10Quarry, 10Documentation-Review-Board, 10Key docs update 2021-22: Quarry docs - https://phabricator.wikimedia.org/T307011 (10KBach) 05In progress→03Resolved
[13:02:26] <wikibugs>	 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Phabricator, 10Product-Analytics, 10wmfdata-python: Herald rule to add Product Analytics and Data Engineering tags to Wmfdata-Python tasks - https://phabricator.wikimedia.org/T304572 (10Aklapper)
[13:02:43] <wikibugs>	 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Phabricator, 10Product-Analytics, 10wmfdata-python: Herald rule to add Product Analytics and Data Engineering tags to Wmfdata-Python tasks - https://phabricator.wikimedia.org/T304572 (10Aklapper)
[13:35:47] <joal>	 ottomata: meeting?
[14:01:53] <ottomata>	 joal:  had a conflict, sorry!
[14:02:02] <ottomata>	 actually i will probably always have that conflict
[14:26:40] <wikibugs>	 10Data-Engineering: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10Ottomata) Oh hm. Yes.  Well maybe.  Scap uses git fat to pull artifacts via rsync from archiva.  This should still work internally.  Errrr, I'm not certain about how the refinery artifact update Jenkins job works...
[14:33:57] <wikibugs>	 10Data-Engineering: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10elukey) Ack! I was thinking also to when one tests stuff like refinery-source locally (on their laptop/workstation outside the WMF), IIRC it pulls directly from archiva.wikimedia.org? it should work via squid pro...
[14:39:28] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) >> comment as empty string > Will this change existing revision_create behaviors or is it jus...
[14:48:48] <wikibugs>	 10Data-Engineering: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10Ottomata) When testing analytics/refinery/source, it is all Archiva / maven API.    analytics/refinery is used for deployment in prod of .jar files pulled from archiva via git-fat rsync.  Sometimes, users might w...
[14:51:53] <wikibugs>	 10Data-Engineering: Move archiva to private IPs + CDN - https://phabricator.wikimedia.org/T317182 (10elukey) ack thanks for the clarification!
[15:31:00] <joal>	 ottomata: no problem for me, but folks from the okapi team are mostly interested in your work :)
[15:34:33] <wikibugs>	 10Quarry, 10GitLab (Project Migration): Move quarry to gitlab or github - https://phabricator.wikimedia.org/T308978 (10rook) As mentioned above github appears to serve the project requirements where gitlab is not quite ready. I don't believe I mentioned it here but one of the main problems is the lack of being...
[15:52:24] <joal>	 mforns: there is something weird with the airflow SLAs now :(
[15:59:41] <mforns>	 :CCCCCCCCC
[15:59:47] <mforns>	 joal: will look
[16:00:00] <joal>	 mforns: I'm in the standup meeting if you wish
[16:00:05] <mforns>	 ok joining
[16:10:37] <joal>	 milimetric: seeing that you check SLA alerts - many of them come from the test-cluister (the aqs_hourly ones) - false positive
[16:11:15] <milimetric>	 yes, the airflow-analytics_test ones I just ignored.  The one I responded to was the clickstream monthly job, that was stuck but got going on its own (or someone else did it, I don't know)
[16:12:28] <joal>	 I think it went on its own - I don't know why it was late though :(
[16:12:33] <joal>	 thanks for checking milimetric 
[16:14:01] <joal>	 milimetric: actually, the SLA for clickstream is 2 days and the job starts on the first of the month - the job can't run in 2 days, so possibly the message is valid
[16:16:15] <joal>	 there weird thing is receiving the email on the 7th for stuff supposedly having waited 2 days
[16:16:20] * joal doesn't understand :S
[16:18:09] <mforns>	 oh joal, so SLAs were from the test cluster
[16:24:57] <joal>	 indeed mforns :)
[16:25:10] <mforns>	 ok ok
[16:34:11] <joal>	 mforns: would you have aminute to show me the errors you see in the graphite-sending jobs?
[17:18:24] <mgerlach>	 Hi, I was working on a jupyter-notebook on stat1008 but now seems unresponsive with some process taking lots of CPU. In case I might have accidentally caused this, I dont know how to check and stop it (no response on jupyterhub). could someone have a look and help. Thank you.
[17:23:50] <joal>	 hey mgerlach - seems I can't connect to stat1008 - bad sign
[17:24:01] <joal>	 ottomata, btullis - could anyone of you help here please?
[17:26:46] <joal>	 wow this is unusual - grafana shows 90% CPU usage for system - first time I see this
[17:32:07] <joal>	 !log make ops reboot stat1008
[17:32:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:33:18] <dsaez_>	 same here
[17:34:54] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator, 10Product-Analytics, 10wmfdata-python: Herald rule to add Product Analytics and Data Engineering tags to Wmfdata-Python tasks - https://phabricator.wikimedia.org/T304572 (10Aklapper) 05Open→03Resolved a:03Aklapper Created H407: When all...
[17:35:25] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10Krinkle)
[17:35:47] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator, 10Product-Analytics, 10wmfdata-python: Herald rule to add Product Analytics and Data Engineering tags to Wmfdata-Python tasks - https://phabricator.wikimedia.org/T304572 (10Aklapper) Note that this does not apply retroactively. Should I mass-a...
[17:35:56] <icinga-wm>	 PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100%
[17:39:41] <mgerlach>	 joal: thanks for taking a look. stat1008 is working for me again. 
[17:39:44] <icinga-wm>	 RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[17:40:04] <joal>	 mgerlach: it has rebooted, but has issues - could you please refrain from launching jobs for a while?
[17:40:10] <joal>	 we're talking about it in ops chan
[17:41:01] <mgerlach>	 joal: yes, no problem. 
[17:41:25] <joal>	 thanks mgerlach 
[17:45:45] <joal>	 mgerlach: you can use the host, but we have an issue with the xmldumps mount, so please don't use it :)
[17:46:19] <joal>	 ping dsaez_ as well - refrain from using the smldumps mount for now please
[17:46:46] <mgerlach>	 joal: understood. thanks again.
[17:47:30] <joal>	 andrewbogott: Heya - it seems we have an issue on stat1008 related to the change we applied on Tuesday :S
[17:48:55] <wikibugs>	 10Analytics-Clusters: Puppet failure on stat1008 - https://phabricator.wikimedia.org/T317359 (10ssingh)
[17:49:34] <btullis>	 joal: I am here now, can I still help somehow?
[17:49:39] <sukhe>	 btullis: see T317359 :)
[17:49:40] <stashbot>	 T317359: Puppet failure on stat1008 - https://phabricator.wikimedia.org/T317359
[17:49:48] <joal>	 thanks again sukhe :)
[17:49:52] <sukhe>	 np, hth 
[17:50:35] <btullis>	 Thank you sukhe.
[17:51:02] <joal>	 btullis: I think the problem is related to a change done last Tuesday with andrewbogott 
[17:51:44] <btullis>	 Got it, that was reverted at the time, wasn't it?
[17:52:52] <andrewbogott>	 I'm not sure it's related, but who is/was user 400?
[17:53:06] <andrewbogott>	 It wasn't fully reverted.  The new mounts were left in place but the preference for the new server was reverted.
[17:53:33] <btullis>	 OK, thanks Andrew.
[17:54:24] <btullis>	 https://www.irccloud.com/pastebin/vL3CUVap/
[17:55:55] <andrewbogott>	 I don't understand why it's chowning and also don't understand why it thinks the mountpoint is read-only (the mount is, but why the mount point?)
[17:57:50] <andrewbogott>	 oh, huh, apparently the ownership changes when it's mounted
[17:58:45] <btullis>	 Yeah, so it will take on the ownership of the root directory of the NFS source. Is there any chance that could have changed during the recent rsyncs? 
[17:59:42] <andrewbogott>	 I don't know -- I doubt it. I especially doubt that it would've changed on the old servers.
[18:01:52] <btullis>	 Why would it only be stat1008. Doing a puppet run on stat1004 to check it out.
[18:02:21] <btullis>	 OK, same failure. That makes a bit of sense anyway.
[18:05:42] <andrewbogott>	 I don't totally understand but on VMs we create the mountpoints with an exec
[18:05:44] <andrewbogott>	 https://www.irccloud.com/pastebin/Zb3faQdZ/
[18:07:21] <joal>	 Thanks a lot btullis for chiming in - and sorry for the hot potatoe :S
[18:07:51] <btullis>	 That's OK. I may have to dash before long though.
[18:08:20] <joal>	 btullis: np, at least the problem seems identified
[18:11:10] <btullis>	 Yeah, apart from puppet run failures, is there any other active issue? Stat1008 was unresponsive, was it?
[18:12:10] <sukhe>	 if I may chime in strictly as someone who rebooted the host and doesn't know much outside of it, it was unresponsive and we knew that because mgerlach said so above
[18:12:13] <sukhe>	 13:18:24 < mgerlach> Hi, I was working on a jupyter-notebook on stat1008 but now seems unresponsive with some process taking lots of CPU. In case I might have accidentally caused this, I dont know how to check and stop it (no response on jupyterhub). could someone have a look and help. Thank you.
[18:12:49] <sukhe>	 it's hard to say though if the notebook caused the issues or was just a symptom of the problem above
[18:13:12] <sukhe>	 though I will note that prior to rebooting the host via IPMI, I noticed on Icinga that it was complaining about the puppet issue above
[18:18:45] <andrewbogott>	 sukhe: you only rebooted 1008, or both?
[18:18:50] <sukhe>	 just 1008
[18:23:32] <andrewbogott>	 ok, so if I suppress my curiosity about 'why now' there are two straightforward fixes -- either do what cloud-vps does with the 'exec' or explicitly set mount-point ownership to 400/400.  btullis, have a preference?
[18:24:23] <andrewbogott>	 although to be honest I don't see why it's trying to set it to 0 in the first place unless that's default behavior when ownership isn't specified.
[18:45:12] <wikibugs>	 10Analytics-Clusters: Puppet failure on stat1008 - https://phabricator.wikimedia.org/T317359 (10Andrew) There are two easy fixes for this.  Either change the mountpoint definition to explicitly set ownership to 400/400 or create mountpoints the way they're created in cloud-vps:   `         # Via exec to graceful...
[19:07:27] <ottomata>	 yargh i'm am bad at IRC these days, orry sorry i missed your ping joal
[19:54:01] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator, 10Product-Analytics, 10wmfdata-python: Herald rule to add Product Analytics and Data Engineering tags to Wmfdata-Python tasks - https://phabricator.wikimedia.org/T304572 (10nshahquinn-wmf) Thank you @Aklapper!  >>! In T304572#8222816, @Aklappe...
[19:58:34] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Radar, 10Product-Analytics, 10wmfdata-python: Consider rewriting wmfdata-python to use omniduct - https://phabricator.wikimedia.org/T275038 (10nshahquinn-wmf) 05Open→03Declined >>! In T275038#7761684, @EChetty wrote: > Just to note -> After sp...
[20:17:16] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Radar, 10Product-Analytics, 10wmfdata-python: Consider rewriting wmfdata-python to use omniduct - https://phabricator.wikimedia.org/T275038 (10Ottomata) +1
[23:57:17] <wikibugs>	 10Data-Engineering, 10Data Pipelines: airflow instances should use specific artifact cache directories - https://phabricator.wikimedia.org/T315374 (10bmansurov) @xcollazo thanks for the ping. If you mean the research instance on `deploy1002`, then I've pulled your changes, rebased on main and deployed. If you...