[07:48:08] my bad, Special and Advanced are the same [11:00:31] lunch [11:32:44] lunch 2 [13:37:54] dcausse would you like to continue by 3:30 [13:44:47] ejoseph: sure [14:37:23] ejoseph: are you around? [14:37:36] Yes [14:37:44] Same link? [14:39:02] sure [15:52:35] \o [15:52:45] ejoseph: I moved our 1:1 tomorrow, I'll be at the doctor with Oscar. Let me know if the new time doesn't work for you [15:57:51] o/ [16:03:41] fyi, slack is still down for me (and others) [16:08:19] gehel it's ok [16:08:30] thanks! [17:46:41] back on slack again, but having some not so fun stiff neck cramps, so i'm going to take a break for a bit [17:47:45] gehel: I'm having trouble with flink@codfw and I'm not yet clear why it's failing to deploy a new version of the app, could you switch wdqs traffic to eqiad? [17:48:29] ryankemper: around? [17:49:26] dcausse: I'm on it [17:49:30] gehel: thanks! [17:49:48] * gehel just needs to remember how we switch traffic [17:53:09] dcausse: we should be good, I depooled both internal and public on codfw [17:53:18] thanks, investigating [17:53:31] scream if you need help, I'll go back to dinner [17:53:54] dcausse: do I also need to depool wcqs? [17:54:16] gehel: no the app is functionning for wcqs [17:54:26] ack [17:54:34] that's why I'm a bit puzzled [17:55:44] not sure I like the exceptions I'm seeing... org.apache.flink.contrib.streaming.state.restore.RocksDBFullRestoreOperation.restoreKVStateData with java.io.EOFException, this does not sound great :/ [17:55:56] :S [17:56:21] will try a previous a savepoint perhaps... [17:58:25] added a note on how to depool a cluster: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Remediation [17:59:14] * gehel goes back to dinner [18:09:29] I'm getting alerts on wcqs2003 [18:09:59] Updater is failing [18:10:07] dcausse: is this related ? [18:10:21] looking [18:11:58] hm.. might be related to the new version [18:13:32] dcausse: ping me if you need me [18:13:40] sure [18:13:59] * gehel is trying to eat a bit of pie before being pinged [18:24:38] patch coming for the problem with wcqs [18:36:56] ebernhardson: if you have a couple minutes https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/764864 [18:45:12] sure [18:47:24] seems simple enough, +2'd [18:48:10] thanks! [18:59:59] Hi search people! I came to say hi because I got pinged by a wikidata maxlag alert. It looks to me like this is because all the 2xxx queryservice servers are lagging for the last 1.5 hrs [19:01:02] gehel: back now [19:01:14] tarrow: thanks, taking a look now [19:01:18] cheers! [19:01:24] tarrow: we have an incident in progress but we should have routed traffic to eqiad which are not affected [19:01:54] codfw is affected by the problem, do you source maxlag from there as well? [19:02:20] I beleive so: https://phabricator.wikimedia.org/T238751 [19:03:15] tarrow: is there a way for you to only source eqiad metrics while the incident is being addressed? [19:04:22] that is a great question; I have no idea! I'm actually not "on the wikidata team" and it's after berlin office hours so let me have a dig around [19:04:59] thanks! [19:08:26] dcausse: ca we depool codfw? [19:08:32] oh wait, no that wouldnt work [19:08:37] already depooled [19:08:38] addshore: we did depool [19:08:47] yeah, tarrow already found the ticket https://phabricator.wikimedia.org/T238751 [19:09:12] I open https://phabricator.wikimedia.org/T302330 for this [19:10:06] https://gerrit.wikimedia.org/g/operations/puppet/+/b81506884faa046e4103412f1928ece934a157dc/modules/profile/manifests/mediawiki/maintenance/wikidata.pp#15 [19:10:09] we can remove the codfw bit [19:19:11] Can any of you deploy https://gerrit.wikimedia.org/r/764875 for us? [19:20:08] dcausse: ^^ [19:20:25] ryankemper: ^^ [19:22:13] We're keen to see it fixed because wikidata users are currently unable to make bot edits [19:23:13] tarrow: ebernhardson: thanks, deploying [19:24:02] cheers! [19:29:38] but hillariously at the same time the wdqs updater started working again too? :P [19:29:54] Did you have a ticket for the original incident on codfw just so I can leave a nice papertrail for the dayshift? [19:32:36] tarrow: I don't think there's a ticket for the original incident yet. I'm working on a lightweight incident report that can be linked to as well [19:33:13] ryankemper: should we cancel our pairing session since Brian is out and you're busy on the incident report? [19:33:30] gehel: yeah, let's do that [19:33:35] ack [19:33:42] ping me if you need me [19:33:52] coolio! no problem; thanks for all the help; have a nice rest of the evening/day :) [19:34:13] tarrow: gutenacht [19:34:18] rollbacked a previous version on a previous checkpoint and it worked... [19:34:28] dcausse: \o/ [19:35:04] still need to understand why this failed... [19:35:19] going to deploy the fix for wcqs [19:37:35] ryankemper: oh cool you triggered a build on the rdf repo [19:38:04] dcausse: yup, should be finished in ~10 mins [19:38:22] dcausse: so I gather the failures are related to event time getting set to null? or is that not the entire problem [19:38:24] kk that should fix the updater issue on the wcqs@codfw updaters [19:38:42] ryankemper: it's related but not the main issue [19:38:50] ack [19:39:14] main issue is flink unable to restart from a savepoint when I deployed a new version of the job [19:40:08] tried multiple things without much luck and finally rolled back a previous version of the app on a old "checkpoint" (not savepoint) [19:40:16] the cause is still unclear [19:40:45] the 2 savepoints I took before the upgrade did not work [19:42:27] interesting [19:51:44] dcausse: rdf build done, I merged https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/764883 [19:52:12] thanks! [19:52:15] dcausse: so is this deploy just to fix the event time null issue, and the other issues have already been addressed with the old checkpoint? [19:52:48] ryankemper: yes, the old checkpoint is just a mitigation as the root cause is yet unclear [19:53:03] ack, understood [19:53:06] dcausse: deploying wcqs now [19:53:43] kk checking the logs to confirm [20:00:43] tarrow: I created https://phabricator.wikimedia.org/T302340 if you still need a ticket to tie things to [20:00:50] dcausse: deploy finished on wcqs fleet [20:01:25] ryankemper: seeing v103 on the machines not 104 [20:02:09] on wdqs2003 at least [20:02:14] err [20:02:18] wcqs2003 I mean [20:03:11] dcausse: bizarre, I ran `scap deploy -v --environment wcqs 'Deploy 0.3.104 to WCQS'`, sshing into the hosts to verify rn [20:03:56] ryankemper: perhaps you forgot to fetch/rebase on the deployment host? [20:04:10] * ryankemper facepalms [20:04:21] dcausse: that's exactly it, thanks [20:04:42] there are some repos that does this automatically, it's less error prone imo [20:08:40] Could pretty trivially wrap the logic in a helper script so we're just running a "deploy_latest_qs w[d,c]qs" type command [20:09:30] dcausse: okay, I see `0.3.104` running now [20:09:43] thanks! [20:10:59] fix seems ok according to wcqs2003 [20:12:20] dcausse: were we seeing errors in the logs previously that resolved? or what are you looking at to confirm [20:12:39] both grafana and the logs [20:12:57] https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1&var-site=codfw&var-k8sds=codfw%20prometheus%2Fk8s&var-opsds=codfw%20prometheus%2Fops&var-cluster_name=wcqs&from=now-1h&to=now [20:13:12] wcqs2003 resumed consumption [20:13:30] will force start the updater on other nodes [20:20:51] inflatador: (for tomorrow) what's the status on T297907 ? Is everything done (especially VictorOps) ? [20:20:51] T297907: SRE Onboarding - Brian King, Search Platform team - https://phabricator.wikimedia.org/T297907 [20:23:52] ebernhardson: isn't T295734 done? I remember you saying that we have 2 instances of Cindy running... [20:23:55] T295734: Bring up two copies of the CirrusSearch browser integration env in cloud - https://phabricator.wikimedia.org/T295734 [20:24:27] gehel: the distinction was the goal was a 7.10 instance, but we didn't have 7.10 plugins ready so i setup the current and a 6.8. I guess i was leaving it around to upgrade the 6.5 instance to 7.10 after we switch [20:24:43] ebernhardson: make sense, thanks [20:24:58] looks like we're out of the woods, going to eat something, will check graphs in a while... if we can afford keeping wdqs@codfw depooled that'd be great so that I can investigate deeper tomorrow [20:26:07] dcausse: kk, thanks for getting it going again [20:26:17] dcausse: thanks ! Get some rest! [20:28:41] ^ echoing the above, thanks david [20:28:55] Keeping wdqs@codfw depooled sounds reasonable to me [23:41:28] hmm, first stab at seeing if geodata was slow finds that the p99 latency elastic<->cirrus is ~1s. But the total time spent in php only comes in at 400ms. Something about these two ways of calculating percentiles don't agree :S [23:41:51] Err, i mean the p99 for the length of a php request that includes a geodata request comes in at ~400ms [23:43:34] i do wonder a bit about those tail latencies, but unclear what we can do. The p95 for elastic<->cirrus is reported as 80ms which is plenty good