[10:03:03] lunch [11:10:35] lunch [12:36:57] o/ [13:28:02] WDQS looks good so far! I guess the fix did work [13:28:47] yes! seems like the rule was the right one [13:42:31] Heading out for my son's graduation. I **should** be back for Weds mtg, but might be a little late. [15:45:40] back [15:45:50] Looks like WDQS is blowing up again in DFW? [15:46:49] nm, my scrollback was too far up [17:15:32] o/ I’m almost done with the redirect handling in EventBus, but there’s one CI check that keeps failing: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/913030 (https://integration.wikimedia.org/ci/job/mwext-php74-phan-docker/46627/console) and I have no clue why that’s an issue. [17:15:53] includes/HookHandlers/MediaWiki/PageChangeHooks.php:473 PhanTypeSuspiciousStringExpression Suspicious type false of a variable or expression $wikiPage->getWikiId() used to build a string. (Expected type to be able to cast to a string) [17:16:52] But there is no boolean at line 473: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/913030/25/includes/HookHandlers/MediaWiki/PageChangeHooks.php#473 [17:30:22] dinner [17:39:57] lunch, back in ~1h [17:43:54] pfischer: maybe beacuse of the === false check? it thinks that it might return false, and you are using the value of it to concat to a string? [18:06:03] ottomata: Well, It only uses === inside the ternary condition. [18:27:16] pfischer: seems like a false positive but not sure you have to bother about logging the wikid, it should always be false (local), relatedly perhaps you should prefer logging vars rather than concatenation: warn("failed to resolve redirect target for source page {title}", [ "title" => $wikiPage->getDBkey(), "exception" => $e ]) [18:34:32] probably phan does not know that getWikiId() is idempotent and it thinks that it could return a string on the first call and then false on the second [18:45:00] back [19:04:33] inflatador: seems like the streaming-updater running in codfw is dead, quickly looking it seems that the jobmanager pods needs a forced restart [19:08:10] dcausse :eyes [19:10:13] seeing some kubernetes alerts, but for eqiad [19:19:09] trying "kubectl rollout restart deployment flink-session-cluster-main" but does not seem like I have the rights to do it [19:21:41] dcausse I can try it if you want. Was looking thru logstash to find a checkpoint ID but I'm not getting results from CODFW [19:22:02] If I drop the "message" match it works, but can't match anything with message:"Completed checkpoint" [19:22:54] hopefully we don't need to recover manually from a checkpoint just a restart might work [19:23:14] Ah OK, I didn't think of tha [19:23:15] tt [19:25:07] dcausse just ran the above cmd as admin and it worked...watching logs now [19:26:57] looks healthy...LMK if if you notice anything amiss [19:27:59] yes seems like it did the trick, thanks! [19:30:43] for logs you need to select the ecs-* index now and use this query: https://logstash.wikimedia.org/goto/62322d75b61f223052c60f96560c929b [19:30:48] we have to update the doc [19:31:26] dcausse ACK, will update docs now [19:32:04] nm, looks like you got it! [19:33:38] or not. OK, I got it [19:42:25] how did you know that CODFW was in trouble? I guess the `RdfStreamingUpdaterFlinkJobUnstable` alert? [19:43:06] yes and then looking at the graph it was all flat since ~2pm CET [19:44:08] but it fired for other reasons, I think we lack some alerting [19:46:20] RdfStreamingUpdaterFlinkJobUnstable fired just recently so something's not right in how we detect failures [19:53:09] hm flink_jobmanager_numRunningJobs{kubernetes_namespace="rdf-streaming-updater"} remained == 2 but flink_jobmanager_job_uptime remained equal to 0 so RdfStreamingUpdaterFlinkJobUnstable should have been triggered earlier... [19:56:47] scratch that RdfStreamingUpdaterFlinkJobUnstable did fire at the right time, I simply missed it [19:58:22] well seeing it in IRC and logstash alert history but not in emails [19:58:32] Funny, I see it in email but not in IRC [19:58:54] at 14:30 UTC? [19:59:12] nm, it's in email too [19:59:30] no, the alert I see is from ~90m ago [19:59:36] I see the email at 18:30 but nothing aroud 14:30 [20:01:18] oh wait I see it now, gmail being too smart and collapsing similar emails [20:01:23] I did get an email alert from `RdfStreamingUpdaterFlinkJobUnstable` at 14:44 UTC [20:01:58] yes me too my bad, was hidden by the earliest alert [20:02:50] I was out of office then and scrolled past it, was looking at yesterday's alerts ;( [20:06:03] hm wonder if I can disable threading in gmail but just for a folder [20:06:52] anyways, going offline but we need to get this flink 1.16 out to avoid this problem again [20:07:56] Can't do pairing at our normal time tomorrow, but could do later or Friday