[06:46:14] <wikibugs>	 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Technical-Debt: Migrate usage of Database::select to SelectQueryBuilder in ORES - https://phabricator.wikimedia.org/T312454 (10Ladsgroup)
[08:51:53] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) Ack! The link for full service restart may be broken, is it https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administrat...
[09:38:11] <elukey>	 mmm the https://logstash.wikimedia.org/app/dashboards#/view/ORES dashboard is completely broken
[09:46:15] <klausman>	 \o
[09:46:55] <klausman>	 elukey: where do we define the URLs revscoring models are reachable under? I am failing to construct a crul request for enwiki-editquality-damaging
[09:48:35] <elukey>	 o/
[09:48:45] <elukey>	 it is all defined by knative and istio IIRC 
[09:48:49] <elukey>	 what link are you using?
[09:49:03] <klausman>	 So this works fine: curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict" -X POST -d @input.json -i -H "Host: enwiki-articlequality.revscoring-articlequality.wikimedia.org" --http1.1
[09:49:18] <klausman>	 But just editing that to editquality I can't get to work
[09:50:59] <elukey>	 for editquality there are 3 segments, goodfaith, damaging and reverted
[09:51:02] <elukey>	 like
[09:51:03] <elukey>	 curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict" -d @input.json -i -H "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --http1.1
[09:51:12] <elukey>	 inference-staging sorry
[09:51:28] <klausman>	 Ah, I had added -damaging in places where I shouldn't have
[09:51:45] <elukey>	 interesting, the command above hangs
[09:52:01] <klausman>	 yep
[09:52:32] <elukey>	 ah right I pointed it to eqiad
[09:52:48] <elukey>	 with codfw works :D
[09:53:53] <klausman>	 same for damaging
[09:53:57] <elukey>	 so I suspect that the ORES dashboard broke when we migrated to Buster
[09:54:13] <elukey>	 last logs that I can see are around 2022-5-12
[09:54:33] <wikibugs>	 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey)
[09:55:10] <klausman>	 That seems likely. I'll finish up testing the models I deployed and then help take a look at the dashboard
[09:57:09] <elukey>	 elukey@ores1001:~$ cat /etc/uwsgi/apps-available/ores.ini | grep logstash
[09:57:09] <elukey>	 log-encoder=json:logstash {"@timestamp":"${strftime:%%Y-%%m-%%dT%%H:%%M:%%S}","type":"ores","logger_name":"uwsgi","host":"%h","level":"INFO","message":"${msg}"}
[09:57:12] <elukey>	 log-route=logstash .*
[09:57:14] <elukey>	 logger=logstash socket:localhost:11514
[09:57:23] <elukey>	 this seems to be config, but I don't see anything listening on 11514
[09:57:39] <klausman>	 Did maybe a default port change?
[10:05:20] <elukey>	 in theory no
[10:05:26] <elukey>	 I see in profile::ores::web
[10:05:26] <elukey>	     # rsyslog forwards json messages sent to localhost along to logstash via kafka
[10:05:29] <elukey>	     class { '::profile::rsyslog::udp_json_logback_compat':
[10:05:32] <elukey>	         port => $logstash_port,
[10:05:34] <elukey>	     }
[10:05:55] <elukey>	 logstash_json_lines_port: 11514 is defined in common hiera
[10:06:16] <elukey>	 but I don't see the listener via netstat, so something doesn't work
[10:16:09] <elukey>	 we do have /etc/rsyslog.d/50-udp-json-logback-compat.conf on ores nodes
[10:18:26] <klausman>	 on 1005:
[10:18:29] <klausman>	 Jul 06 00:00:03 ores1005 rsyslogd[16692]: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down [v8.1901.0 try https://www.rsyslog.com/e/2422 ]
[10:19:02] <elukey>	 this happens from time to time
[10:19:05] <elukey>	 udp        0      0 127.0.0.1:11514         0.0.0.0:*                           12785/rsyslogd  
[10:19:19] <elukey>	 of course I was checking tcp ports, not udp ones
[10:19:25] <elukey>	 so we do have the listener
[10:20:35] <elukey>	 of course I was checking tcp ports, not udp ones :D
[10:21:07] <klausman>	 I also see established TCP connections from rsyslogd to kafka-logging machines
[10:22:35] <elukey>	 the udp_json_logback_compat_topic topic is empty though
[10:24:44] <klausman>	 Since we probably don't do outgoing netfilter rules, it can't really be that, either
[10:25:44] <klausman>	 Would the exporter/listener run as a separate process? Since I only see rsyslogd and and the prom exporter as actual processes
[10:26:32] <elukey>	 it should be part of rsyslogd itself 
[10:26:38] <elukey>	 I imagine a thread
[10:26:52] <klausman>	 yeah, rsyslogd is what's listening on 11514 here
[10:28:49] <klausman>	 an strace on rsyslogd on recvmsg calls also only really shows systemd logging stuff (fd 3, the udp listener is on fd's 8 and 9)
[10:29:23] <klausman>	 So nothing is sending stuff to that socket
[10:29:32] <klausman>	 (or it never arrives)
[10:29:55] <klausman>	 sendmsg (sending on a UDP socket) doesn't seem to happen at all
[10:30:43] <klausman>	 correction, very sparsely on sockets 13 and 15, which are used to talked to centrallog1001 and 2002
[10:31:07] <klausman>	 Who/what would be sending to port 11524?
[10:31:11] <klausman>	 11514*
[10:31:46] <elukey>	 uwsgi in theory, I am seeing some traffic via tcpdump but the pcap doesn't show anything carried, all len 0
[10:31:50] <elukey>	 maybe it is uwsgi
[10:37:10] <elukey>	 need to go now, but I suspect this is an issue with uwsgi on buster
[10:37:20] <elukey>	 it would explain what happened to our logging
[10:37:25] <klausman>	 alright, I'll keep digging
[10:37:26] * elukey lunch!
[10:37:33] <elukey>	 lemme know what you find!
[11:00:45] <klausman>	 elukey: current working hypothesis: On newer uwisgi using `logger=logstash socket:localhost:11514` will not work because for some godawful reason it's interpreted as a sockt/filesystem name. I did a quick test on 1005, replacing it with 127.0.0.1, and immediately saw datagrams to 11514. But I am not sure that was enough/it. Also Puppet quickly corrected my "mistake"
[11:02:21] <klausman>	 hrm, no, that wasn't it. After puppet reverted things, I still see msgs (all size 0, as you mentioned)
[11:09:08] <klausman>	 I added a debug logger that is like the JSON socket logger, but logging to a local file. It works fine.
[11:11:07] <klausman>	 tshark also shows actual, good datagrams on lo, so the size=0 thing is probably a red herring
[11:14:15] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway), 10Platform Team Workboards (Platform Engineering Reliability): Proposal: add a per-service rate limit setting to API Gateway - https://phabricator.wikimedia.org/T295956 (10hnowlan) This has been implemented and deployed...
[11:23:16] <klausman>	 Ok, some more rummaging insights: JSON is emitted by uwsgim, send to rsyslogd, which uses recvmmsg (sic) to see the JSON.
[11:26:23] <klausman>	 https://phabricator.wikimedia.org/P30941 See here. th EAGAIN errors are normal, it's just rsyslogd making sure it got all data
[11:28:38] <klausman>	 I _also_ see traffic to the kafka servers, but it's TLS, so no ide about content
[11:32:23] <klausman>	 Ok, whatever I did repaired 1005, I think. The dashboard shows its data
[11:35:50] <klausman>	 I think this did the trick, in ores.ini:
[11:35:55] <klausman>	 -logger=logstash socket:localhost:11514
[11:35:57] <klausman>	 +logger=logstash socket:127.0.0.1:11514
[11:36:30] <klausman>	 Either uwsgi interprets the name as a filesystempath somewhere, or its DNS lookup (or hosts or whetever getent) fails.
[12:35:09] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) Yup!  And just fixed that link, thank you!
[13:08:57] <elukey>	 klausman: wow nice debugging! Do you want to send the code review?
[13:12:19] <klausman>	 I can. Thing is, changing ores.ini is of course papering over the problem, so to speak
[13:13:33] <elukey>	 klausman: sure, but in the meantime we have logs :D We can probably open a more generic task and see if other uwsgi instances have the same issue
[13:13:45] <klausman>	 YArp.
[13:16:00] <klausman>	 review sent
[13:16:25] <klausman>	 I _think_ the other instance of `localhost` is fine.
[13:17:47] <elukey>	 I am running pcc now
[13:22:23] <elukey>	 klausman: one quick thing - uwsgi will be restarted in theory, so let's not run puppet on all nodes 
[13:22:38] <elukey>	 maybe in little batches if you use cumin
[13:22:45] <klausman>	 ack
[13:23:16] <elukey>	 eqiad.mediawiki.revision-score-test --> new topic in kafka main :)
[13:24:22] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) The curl command above works now! I can see the `eqiad.mediawiki.revision-score-test` topic in Kafka main too.
[13:24:36] <klausman>	 doing puppet agent runs on ores1* in bacthes of 3
[13:28:10] <klausman>	 Dammit, it didn't work
[13:28:17] <klausman>	 Still says `localhost` in /etc/uwsgi/apps-enabled/ores.ini
[13:28:57] <klausman>	 Shoudl I have changed service::configuration::logstash_host: instead?
[13:29:26] <elukey>	 so it changed /etc/ores/99-main.yaml, I thought it was the right file
[13:29:42] <klausman>	 My bad
[13:29:51] <elukey>	 nah it is fine :)
[13:30:13] <klausman>	 would the above var then be the right one? Eh, I'll PCC it
[13:31:56] <elukey>	 it is likely the right one
[13:33:18] <klausman>	 Yes, this looks better
[13:37:38] <klausman>	 alright sent for review :)
[13:37:53] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Test async preprocess on kserve - https://phabricator.wikimedia.org/T309623 (10achou) @kevinbazira As I showed in the meeting, there will be an `AsyncSession` class for the async use case, so it won't disrupt other mwapi library users w...
[13:41:24] <klausman>	 merged, running agent
[13:47:17] <wikibugs>	 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey)
[13:48:31] <klausman>	 Weird. That did not fix it?
[13:49:19] <elukey>	 what do you mean?
[13:49:38] <klausman>	 Still no messages as being read by rsyslog
[13:49:46] <klausman>	 How long does a restart take, typically?
[13:50:17] <elukey>	 on ores1001 I see non-zero udp messages
[13:50:44] <klausman>	 Still seeing Len=0 messages on 1005
[13:51:31] <elukey>	 I am using `sudo tcpdump -i lo udp port 11514 -vv -X` and I see non-zero msg on 1005
[13:51:42] <klausman>	 wait, I see Len=0 on ores1001. How do you trace packets?
[13:53:28] <klausman>	 Ok, I _occasionally_ see non-0 lengths
[13:54:17] <elukey>	 still nothing on the logstash board though
[13:54:27] <elukey>	 aiko: o/
[13:54:37] <klausman>	 Weird.
[13:54:49] <elukey>	 aiko: I'd need to test https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/808247, do you have suggestions?
[13:55:23] <elukey>	 I am also thinking long term on the ml-sandbox, we may need an endpoint (even the simple python http server) to test this
[13:56:15] <klausman>	 elukey: I have a hrroible suspicion
[13:57:24] <klausman>	 elukey: so for debugging, I added an additional lohher to ores on 1005. Same format as the json-to-11541 logger, but logging to a local file (/srv/log/ores/json-debug.log)
[13:57:56] <klausman>	 I just added that in, and boom, traffic to 11541 is non-0 Len
[13:59:45] <klausman>	 Reverted it, back to 0
[13:59:58] <klausman>	 This is some grade-A BS
[14:00:21] <klausman>	 You can even see the short uptick in traffic on logstash
[14:09:38] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) Nice!
[14:10:59] <elukey>	 klausman: I see traffic from 1005 only in https://logstash.wikimedia.org/app/dashboards#/view/ORES, weird
[14:11:20] <klausman>	 See above: enabling a local logger with the same config makes the UDP logger work
[14:11:26] <klausman>	 This reeks of a uwsgi bug
[14:11:47] <elukey>	 yes yes but it may also be that we migrated to a new version that requires a different config 
[14:12:04] <klausman>	 Not as far as I can tell
[14:12:08] <elukey>	 maybe the old one leads to a bug, or we are using a version in buster that is bugged and upstream already fixed it
[14:12:30] <klausman>	 The config for loggers seems to still be the same as the way we use it.
[14:13:14] <klausman>	 And arguably, adding another logger should not change behavior of an existing one. Plus, those weird empty UDP datagrams. My money is on "this is a bug". Currently browsing uwsgi on GH to find something
[14:14:31] <elukey>	 so on stretch we had 2.0.14+20161117-3+deb9u2+wmf1 and on buster we are using 2.0.18, that is the debian upstream version
[14:14:48] <elukey>	 I recall that I added a patch on stretch, and we rebuilt the package
[14:15:01] <elukey>	 that was supposed to be added on the new version
[14:15:02] <klausman>	 Latest upstream is 2.0.20
[14:15:04] <elukey>	 lemme check
[14:16:17] <elukey>	 it was https://phabricator.wikimedia.org/T212697 but not really related in theory
[14:19:01] * elukey little break
[14:28:45] <klausman>	 Filed https://phabricator.wikimedia.org/T312550 for uwsgi
[14:43:54] <chrisalbon>	 Morning all! 
[14:44:24] <klausman>	 \o
[14:44:30] <chrisalbon>	 My phone literally ran out of batteries so my alarm didn't go off. What a weird bug.
[14:44:44] <klausman>	 chris, if you want to read about a bizarre bug re: ORES, I recommend https://phabricator.wikimedia.org/T312550
[14:45:59] <chrisalbon>	 lol, sigh
[14:46:28] <klausman>	 At least it's not _in_ ORES. As far as I can tell.
[14:47:31] <chrisalbon>	 This is like the double slit science experiment.
[14:47:47] <klausman>	 elukey: one "fix" we could use for this new situation is to add a local logger like I did, and have it write to /dev/null
[14:48:02] <klausman>	 But I dunno yet how messy that will be to get into puppet.
[14:48:57] <chrisalbon>	 (┛ಠ_ಠ)┛彡┻━┻
[14:49:04] <klausman>	 Indeed
[14:51:56] <elukey>	 another possibility is to file a github issue to upstream, explaning the problem and asking for advices
[15:08:35] <elukey>	 filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/812010/ to allow traffic from the lift wing mesh to eventgate main
[15:09:04] <elukey>	 once the code for kserve is reviewed deployed we should be able to test the generation of new revscoring events directly from Lift Wing
[15:18:28] <elukey>	 klausman: while we debug the uwsgi issue, could you please file a github issue to uwsgi upstream? Basically with nothing more than the description that you added in the task
[15:18:35] <elukey>	 so we can proceed in parallel
[15:18:42] <klausman>	 ack, will do
[15:18:47] <elukey>	 thanks :)
[15:29:45] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Test async preprocess on kserve - https://phabricator.wikimedia.org/T309623 (10elukey) >>! In T309623#8054613, @kevinbazira wrote: > @achou thank you for digging into the async-mediawiki library. Following yesterday's chat in the meetin...
[15:29:51] <elukey>	 kevinbazira: --^
[15:30:05] <elukey>	 (tried to add some details about what me and Aiko have in mind)
[15:32:28] <elukey>	 chrisalbon: is it ok to drop what listed in https://phabricator.wikimedia.org/T307389 ?
[15:33:35] <elukey>	 ah interesting, they power https://labels.wmflabs.org/
[15:34:11] <elukey>	 if we want to keep it then we'll need to tell cloud services what to do
[15:35:08] <wikibugs>	 (03CR) 10Elukey: "The code is yet to be tested, but I think that it is ready for a first pass of comments :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/808247 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[15:35:12] <chrisalbon>	 yeah we need to keep labels.wmflabs.org working
[15:35:28] <chrisalbon>	 Not forever, but until we have something better, which we don't right now
[15:35:48] <chrisalbon>	 I'll add that comment to phab
[15:35:49] <chrisalbon>	 sorry
[15:36:50] <elukey>	 ok so IIUC we'd need to put some work in creating new VMs elsewhere, so we unblock cloud services
[15:38:38] <elukey>	 I'll read the comments on the task :)
[15:38:47] <elukey>	 wrapping up for today folks, have a nice rest of the day :)
[15:38:51] <wikibugs>	 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10Cloud-VPS (Debian Stretch Deprecation), 10cloud-services-team (Kanban): Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10calbon) These power Wikilabels, which is how training data is collected to creat...
[15:39:38] <chrisalbon>	 elukey I'll create a phab task in our board for moving to VM. Have a great night!
[15:41:45] <taavi>	 note that in addition of the database VMs in clouddb-services I've been nagging you about, there's also a stretch vm in the wikilabels project that needs to be upgraded: https://os-deprecation.toolforge.org/stretch/wikilabels.html
[15:43:04] <chrisalbon>	 lol, why does it have so many. Sigh. Taavi can you add that note to the ticket so whomever gets assigned it on my team can handle that as well.
[15:44:34] <chrisalbon>	 Most of these db's must be orphans and not used by Wikilabels
[15:44:50] <chrisalbon>	 I don't even understand why Wikilabels is using two Postgres instances
[15:45:00] <chrisalbon>	 It almost certainly should only be using 1. 
[15:45:27] <chrisalbon>	 This is all to say we might need to do some clean upo
[15:45:32] <taavi>	 one is a replica of the another
[15:45:43] <taavi>	 added a comment to T312564
[15:47:02] <chrisalbon>	 We should move all the old data to DE's data lake and then just reinstall Wikilabels on a VM with a fresh DB
[15:47:04] <chrisalbon>	 Thanks Taavi