[06:46:14] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Technical-Debt: Migrate usage of Database::select to SelectQueryBuilder in ORES - https://phabricator.wikimedia.org/T312454 (10Ladsgroup) [08:51:53] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) Ack! The link for full service restart may be broken, is it https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administrat... [09:38:11] mmm the https://logstash.wikimedia.org/app/dashboards#/view/ORES dashboard is completely broken [09:46:15] \o [09:46:55] elukey: where do we define the URLs revscoring models are reachable under? I am failing to construct a crul request for enwiki-editquality-damaging [09:48:35] o/ [09:48:45] it is all defined by knative and istio IIRC [09:48:49] what link are you using? [09:49:03] So this works fine: curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict" -X POST -d @input.json -i -H "Host: enwiki-articlequality.revscoring-articlequality.wikimedia.org" --http1.1 [09:49:18] But just editing that to editquality I can't get to work [09:50:59] for editquality there are 3 segments, goodfaith, damaging and reverted [09:51:02] like [09:51:03] curl "https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-goodfaith:predict" -d @input.json -i -H "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --http1.1 [09:51:12] inference-staging sorry [09:51:28] Ah, I had added -damaging in places where I shouldn't have [09:51:45] interesting, the command above hangs [09:52:01] yep [09:52:32] ah right I pointed it to eqiad [09:52:48] with codfw works :D [09:53:53] same for damaging [09:53:57] so I suspect that the ORES dashboard broke when we migrated to Buster [09:54:13] last logs that I can see are around 2022-5-12 [09:54:33] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey) [09:55:10] That seems likely. I'll finish up testing the models I deployed and then help take a look at the dashboard [09:57:09] elukey@ores1001:~$ cat /etc/uwsgi/apps-available/ores.ini | grep logstash [09:57:09] log-encoder=json:logstash {"@timestamp":"${strftime:%%Y-%%m-%%dT%%H:%%M:%%S}","type":"ores","logger_name":"uwsgi","host":"%h","level":"INFO","message":"${msg}"} [09:57:12] log-route=logstash .* [09:57:14] logger=logstash socket:localhost:11514 [09:57:23] this seems to be config, but I don't see anything listening on 11514 [09:57:39] Did maybe a default port change? [10:05:20] in theory no [10:05:26] I see in profile::ores::web [10:05:26] # rsyslog forwards json messages sent to localhost along to logstash via kafka [10:05:29] class { '::profile::rsyslog::udp_json_logback_compat': [10:05:32] port => $logstash_port, [10:05:34] } [10:05:55] logstash_json_lines_port: 11514 is defined in common hiera [10:06:16] but I don't see the listener via netstat, so something doesn't work [10:16:09] we do have /etc/rsyslog.d/50-udp-json-logback-compat.conf on ores nodes [10:18:26] on 1005: [10:18:29] Jul 06 00:00:03 ores1005 rsyslogd[16692]: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down [v8.1901.0 try https://www.rsyslog.com/e/2422 ] [10:19:02] this happens from time to time [10:19:05] udp 0 0 127.0.0.1:11514 0.0.0.0:* 12785/rsyslogd [10:19:19] of course I was checking tcp ports, not udp ones [10:19:25] so we do have the listener [10:20:35] of course I was checking tcp ports, not udp ones :D [10:21:07] I also see established TCP connections from rsyslogd to kafka-logging machines [10:22:35] the udp_json_logback_compat_topic topic is empty though [10:24:44] Since we probably don't do outgoing netfilter rules, it can't really be that, either [10:25:44] Would the exporter/listener run as a separate process? Since I only see rsyslogd and and the prom exporter as actual processes [10:26:32] it should be part of rsyslogd itself [10:26:38] I imagine a thread [10:26:52] yeah, rsyslogd is what's listening on 11514 here [10:28:49] an strace on rsyslogd on recvmsg calls also only really shows systemd logging stuff (fd 3, the udp listener is on fd's 8 and 9) [10:29:23] So nothing is sending stuff to that socket [10:29:32] (or it never arrives) [10:29:55] sendmsg (sending on a UDP socket) doesn't seem to happen at all [10:30:43] correction, very sparsely on sockets 13 and 15, which are used to talked to centrallog1001 and 2002 [10:31:07] Who/what would be sending to port 11524? [10:31:11] 11514* [10:31:46] uwsgi in theory, I am seeing some traffic via tcpdump but the pcap doesn't show anything carried, all len 0 [10:31:50] maybe it is uwsgi [10:37:10] need to go now, but I suspect this is an issue with uwsgi on buster [10:37:20] it would explain what happened to our logging [10:37:25] alright, I'll keep digging [10:37:26] * elukey lunch! [10:37:33] lemme know what you find! [11:00:45] elukey: current working hypothesis: On newer uwisgi using `logger=logstash socket:localhost:11514` will not work because for some godawful reason it's interpreted as a sockt/filesystem name. I did a quick test on 1005, replacing it with 127.0.0.1, and immediately saw datagrams to 11514. But I am not sure that was enough/it. Also Puppet quickly corrected my "mistake" [11:02:21] hrm, no, that wasn't it. After puppet reverted things, I still see msgs (all size 0, as you mentioned) [11:09:08] I added a debug logger that is like the JSON socket logger, but logging to a local file. It works fine. [11:11:07] tshark also shows actual, good datagrams on lo, so the size=0 thing is probably a red herring [11:14:15] 10Machine-Learning-Team, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway), 10Platform Team Workboards (Platform Engineering Reliability): Proposal: add a per-service rate limit setting to API Gateway - https://phabricator.wikimedia.org/T295956 (10hnowlan) This has been implemented and deployed... [11:23:16] Ok, some more rummaging insights: JSON is emitted by uwsgim, send to rsyslogd, which uses recvmmsg (sic) to see the JSON. [11:26:23] https://phabricator.wikimedia.org/P30941 See here. th EAGAIN errors are normal, it's just rsyslogd making sure it got all data [11:28:38] I _also_ see traffic to the kafka servers, but it's TLS, so no ide about content [11:32:23] Ok, whatever I did repaired 1005, I think. The dashboard shows its data [11:35:50] I think this did the trick, in ores.ini: [11:35:55] -logger=logstash socket:localhost:11514 [11:35:57] +logger=logstash socket:127.0.0.1:11514 [11:36:30] Either uwsgi interprets the name as a filesystempath somewhere, or its DNS lookup (or hosts or whetever getent) fails. [12:35:09] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) Yup! And just fixed that link, thank you! [13:08:57] klausman: wow nice debugging! Do you want to send the code review? [13:12:19] I can. Thing is, changing ores.ini is of course papering over the problem, so to speak [13:13:33] klausman: sure, but in the meantime we have logs :D We can probably open a more generic task and see if other uwsgi instances have the same issue [13:13:45] YArp. [13:16:00] review sent [13:16:25] I _think_ the other instance of `localhost` is fine. [13:17:47] I am running pcc now [13:22:23] klausman: one quick thing - uwsgi will be restarted in theory, so let's not run puppet on all nodes [13:22:38] maybe in little batches if you use cumin [13:22:45] ack [13:23:16] eqiad.mediawiki.revision-score-test --> new topic in kafka main :) [13:24:22] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) The curl command above works now! I can see the `eqiad.mediawiki.revision-score-test` topic in Kafka main too. [13:24:36] doing puppet agent runs on ores1* in bacthes of 3 [13:28:10] Dammit, it didn't work [13:28:17] Still says `localhost` in /etc/uwsgi/apps-enabled/ores.ini [13:28:57] Shoudl I have changed service::configuration::logstash_host: instead? [13:29:26] so it changed /etc/ores/99-main.yaml, I thought it was the right file [13:29:42] My bad [13:29:51] nah it is fine :) [13:30:13] would the above var then be the right one? Eh, I'll PCC it [13:31:56] it is likely the right one [13:33:18] Yes, this looks better [13:37:38] alright sent for review :) [13:37:53] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Test async preprocess on kserve - https://phabricator.wikimedia.org/T309623 (10achou) @kevinbazira As I showed in the meeting, there will be an `AsyncSession` class for the async use case, so it won't disrupt other mwapi library users w... [13:41:24] merged, running agent [13:47:17] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey) [13:48:31] Weird. That did not fix it? [13:49:19] what do you mean? [13:49:38] Still no messages as being read by rsyslog [13:49:46] How long does a restart take, typically? [13:50:17] on ores1001 I see non-zero udp messages [13:50:44] Still seeing Len=0 messages on 1005 [13:51:31] I am using `sudo tcpdump -i lo udp port 11514 -vv -X` and I see non-zero msg on 1005 [13:51:42] wait, I see Len=0 on ores1001. How do you trace packets? [13:53:28] Ok, I _occasionally_ see non-0 lengths [13:54:17] still nothing on the logstash board though [13:54:27] aiko: o/ [13:54:37] Weird. [13:54:49] aiko: I'd need to test https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/808247, do you have suggestions? [13:55:23] I am also thinking long term on the ml-sandbox, we may need an endpoint (even the simple python http server) to test this [13:56:15] elukey: I have a hrroible suspicion [13:57:24] elukey: so for debugging, I added an additional lohher to ores on 1005. Same format as the json-to-11541 logger, but logging to a local file (/srv/log/ores/json-debug.log) [13:57:56] I just added that in, and boom, traffic to 11541 is non-0 Len [13:59:45] Reverted it, back to 0 [13:59:58] This is some grade-A BS [14:00:21] You can even see the short uptick in traffic on logstash [14:09:38] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) Nice! [14:10:59] klausman: I see traffic from 1005 only in https://logstash.wikimedia.org/app/dashboards#/view/ORES, weird [14:11:20] See above: enabling a local logger with the same config makes the UDP logger work [14:11:26] This reeks of a uwsgi bug [14:11:47] yes yes but it may also be that we migrated to a new version that requires a different config [14:12:04] Not as far as I can tell [14:12:08] maybe the old one leads to a bug, or we are using a version in buster that is bugged and upstream already fixed it [14:12:30] The config for loggers seems to still be the same as the way we use it. [14:13:14] And arguably, adding another logger should not change behavior of an existing one. Plus, those weird empty UDP datagrams. My money is on "this is a bug". Currently browsing uwsgi on GH to find something [14:14:31] so on stretch we had 2.0.14+20161117-3+deb9u2+wmf1 and on buster we are using 2.0.18, that is the debian upstream version [14:14:48] I recall that I added a patch on stretch, and we rebuilt the package [14:15:01] that was supposed to be added on the new version [14:15:02] Latest upstream is 2.0.20 [14:15:04] lemme check [14:16:17] it was https://phabricator.wikimedia.org/T212697 but not really related in theory [14:19:01] * elukey little break [14:28:45] Filed https://phabricator.wikimedia.org/T312550 for uwsgi [14:43:54] Morning all! [14:44:24] \o [14:44:30] My phone literally ran out of batteries so my alarm didn't go off. What a weird bug. [14:44:44] chris, if you want to read about a bizarre bug re: ORES, I recommend https://phabricator.wikimedia.org/T312550 [14:45:59] lol, sigh [14:46:28] At least it's not _in_ ORES. As far as I can tell. [14:47:31] This is like the double slit science experiment. [14:47:47] elukey: one "fix" we could use for this new situation is to add a local logger like I did, and have it write to /dev/null [14:48:02] But I dunno yet how messy that will be to get into puppet. [14:48:57] (┛ಠ_ಠ)┛彡┻━┻ [14:49:04] Indeed [14:51:56] another possibility is to file a github issue to upstream, explaning the problem and asking for advices [15:08:35] filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/812010/ to allow traffic from the lift wing mesh to eventgate main [15:09:04] once the code for kserve is reviewed deployed we should be able to test the generation of new revscoring events directly from Lift Wing [15:18:28] klausman: while we debug the uwsgi issue, could you please file a github issue to uwsgi upstream? Basically with nothing more than the description that you added in the task [15:18:35] so we can proceed in parallel [15:18:42] ack, will do [15:18:47] thanks :) [15:29:45] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Test async preprocess on kserve - https://phabricator.wikimedia.org/T309623 (10elukey) >>! In T309623#8054613, @kevinbazira wrote: > @achou thank you for digging into the async-mediawiki library. Following yesterday's chat in the meetin... [15:29:51] kevinbazira: --^ [15:30:05] (tried to add some details about what me and Aiko have in mind) [15:32:28] chrisalbon: is it ok to drop what listed in https://phabricator.wikimedia.org/T307389 ? [15:33:35] ah interesting, they power https://labels.wmflabs.org/ [15:34:11] if we want to keep it then we'll need to tell cloud services what to do [15:35:08] (03CR) 10Elukey: "The code is yet to be tested, but I think that it is ready for a first pass of comments :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/808247 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [15:35:12] yeah we need to keep labels.wmflabs.org working [15:35:28] Not forever, but until we have something better, which we don't right now [15:35:48] I'll add that comment to phab [15:35:49] sorry [15:36:50] ok so IIUC we'd need to put some work in creating new VMs elsewhere, so we unblock cloud services [15:38:38] I'll read the comments on the task :) [15:38:47] wrapping up for today folks, have a nice rest of the day :) [15:38:51] 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10Cloud-VPS (Debian Stretch Deprecation), 10cloud-services-team (Kanban): Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10calbon) These power Wikilabels, which is how training data is collected to creat... [15:39:38] elukey I'll create a phab task in our board for moving to VM. Have a great night! [15:41:45] note that in addition of the database VMs in clouddb-services I've been nagging you about, there's also a stretch vm in the wikilabels project that needs to be upgraded: https://os-deprecation.toolforge.org/stretch/wikilabels.html [15:43:04] lol, why does it have so many. Sigh. Taavi can you add that note to the ticket so whomever gets assigned it on my team can handle that as well. [15:44:34] Most of these db's must be orphans and not used by Wikilabels [15:44:50] I don't even understand why Wikilabels is using two Postgres instances [15:45:00] It almost certainly should only be using 1. [15:45:27] This is all to say we might need to do some clean upo [15:45:32] one is a replica of the another [15:45:43] added a comment to T312564 [15:47:02] We should move all the old data to DE's data lake and then just reinstall Wikilabels on a VM with a fresh DB [15:47:04] Thanks Taavi