[07:16:06] scap's image build seems to be taking a while today. not sure if that's the usual scap slowness or something related to the deployment server switchover yesterday [07:20:16] now finished after 10 minutes [07:39:06] may I direct your attention to ongoing criticals? I am not particularly worried about any, but with so many things in red, it is later more difficult to notice new anomalies, which is important in this week of changes. For example, I am going to ack and notify cloud of ongoing cloudservices2004-dev backup failures. [08:18:19] <_joe_> taavi: i guess it needed to do a full image rebuild [08:18:44] <_joe_> taavi: did it happen on every deployment or just one? [08:19:07] _joe_: I only did one deployment this morning so not sure, sorry [09:05:53] <_joe_> gehel: sorry for pinging you but I don't know who else to ping at this hour; there is an alert about MediawikiPageContentChangeEnrichJobManagerNotRunning in eqiad [09:06:34] _joe_: thanks for the ping ! I'll find someone to look ! [09:06:47] <_joe_> it looks like it stopped at 15:40 UTC yesterday [09:07:18] <_joe_> and moved to codfw, apparently [09:09:35] <_joe_> and there's nothing in SAL [09:11:30] <_joe_> which should be avoided, if it was a human operation. [09:12:29] Jennifer and Joseph are on opsweek in Data Engineering this week. Gabriele knows this part best and is also having a look. [09:20:38] MediawikiPageContentChangeEnrichJobManagerNotRunning seems related to DC switch. Codfw is processing the updates (as it should - no updates are happening in eqiad at this time). [09:21:07] No emergency on this, since we don't expect work to happen in eqiad in the moment for that job. We still need to understand why it is not running. [09:21:20] Gabriele is on it and will post an update [09:23:41] <_joe_> gehel: that's wrong, the updates are happening in eqiad [09:23:55] <_joe_> we haven't switched the jobrunners or the editing path [09:24:11] <_joe_> ah right but eventgate is in codfw [09:24:38] <_joe_> gehel: why nothing is running in eqiad, while we had 1 thread idling in codfw? [09:25:14] I only have limited understanding of that job. But it is a Flink job, consuming the Kafka streams behind eventgsze directly. [09:25:51] <_joe_> yeah my point is it shouldn't "die" [09:26:12] Yeah, dying is not the expected behaviour! [09:26:59] <_joe_> can I bash this? [09:27:54] Do we have an easy way to see the state of the current DC switch? [09:28:09] Sure! Bash all you want ;) [09:28:59] gehel: what do you want to know? [09:29:38] I'm balanced between adding "dying is the only certainty we have" and "I'm immortal. Proof: I've never died". [09:29:47] gehel: status of service catalogue services with discovery can be gotten by going to cumin, sudo cookbook -d sre.discovery.datacenter status all [09:29:51] if you have access [09:30:46] <_joe_> gehel: the easiest for you is [09:30:53] claime: that cookbook is a good start! [09:31:03] <_joe_> ah claime beat me to it [09:31:21] I'm very fast, except when I'm very slow [09:31:23] :p [09:31:57] what I'm wondering here is: where are edits happening, where is jobrunner running [09:32:27] eqiad is still the active mediawiki datacenter [09:32:49] <_joe_> gehel: both are still in eqiad [09:32:58] <_joe_> but mediawiki emits events to eventgate [09:33:03] <_joe_> which is now codfw-only [09:33:11] <_joe_> so events are produced to the codfw kafkas [09:33:35] * gehel dreams of a dynamically generated map of all dataflows :) [09:33:51] <_joe_> gehel: you mean tracing? [09:33:54] <_joe_> we're working on it [09:33:59] cool! [09:34:18] something like it, probably [09:34:25] * gehel is going back to meetings... [13:16:41] Reminder that we'll start locking things down in about 15 minutes for the switchover [13:21:14] claime: Good luck! 🤞 [13:21:35] I'm just watching this time, well wishes go to kamila_ :D [13:22:24] Well wishes kamila_ !! [13:22:47] thanks '^^ [13:22:53] (no pressure, right? :D) [13:23:22] Good luck [13:23:50] https://www.youtube.com/watch?v=a01QQZyl-_I [13:24:39] bblack: I love that song, Bowie and Mercury in the same song is a dream come true. [13:24:47] :) [13:32:17] good luck, it'll be ok :) [13:36:25] bblack: I've been looping it since yesterday :D [13:37:47] anyways, starting soonish, tmux `switchover` on kamila@cumin1001 if you want to follow (please attach readonly) [13:39:30] 🤞 [13:39:43] gl! [13:40:07] everyone, make sure to not perform any host related or cookbook related actions please. Whatever it is, it can wait for 1h or so. [13:41:53] <_joe_> https://j.gifs.com/v2MvJX.gif [13:42:48] thanks _joe_, that really helps :D [13:48:38] rotfl [13:49:50] okay, off we go! [13:49:57] \o/ [13:50:39] gl! [13:51:04] really hoping to not need luck today :D [13:51:09] but thanks :D [13:52:10] * volans attached and available if anything is needed [13:53:10] thanks volans <3 [13:56:15] <_joe_> we should add a link to the spotify playlist "elevator music" for these 5 minutes of wait [13:56:30] kamila_: You can lock scap now [13:56:31] :D [13:57:05] thanks claime :-) [13:57:45] <_joe_> kamila_: I would re-try it [13:57:45] elevator music is http://listen.hatnote.com/#uk,fr,sv,he,as,pa,ml,or,pl,sr,fi,eo,pt,no,bg,mk,sa,mr,te,hi,id,ar,nl,ja,de,ru,es,it,fa,zh,bn,ta,kn,gu,be,el,et,hu,en - no? [13:57:46] run it again [13:57:47] _joe_: or just play http://listen.hatnote.com/ (wikipedia edits as sound)? [13:57:59] jelto: It's in the switchover doc as a monitoring tool [13:58:07] ah great [13:58:13] <_joe_> ok let me checkon the server [13:58:23] thank you <3 [13:58:56] <_joe_> you should be allright [13:59:10] <_joe_> we can check why it's reporting failing later [13:59:10] wanna bet it's the failed jobs that are making the cookbook sad? [13:59:16] <_joe_> claime: possibly [13:59:48] <_joe_> systemctl list-timers | grep mediawiki returns nothing [13:59:50] <_joe_> so we're ok [14:00:09] are we coordinating here or on -operations? [14:00:11] ok [14:00:12] here [14:00:13] <_joe_> go for readonly when you want [14:00:13] marostegui: here [14:00:16] k [14:00:17] <_joe_> marostegui: here [14:00:58] <_joe_> hat is silent, proceed [14:01:13] that silence *shivers* [14:01:20] <_joe_> oh come on [14:01:22] :D [14:01:28] lol @stashbot [14:02:09] I am checking eswiki ro/rw (s7) [14:02:28] <_joe_> it's a go for me [14:02:37] 👍 [14:02:42] same here [14:02:42] And it's back [14:02:44] I hear sounds [14:02:45] Good job [14:02:47] <_joe_> and it's baack [14:02:47] whee we have pretty sounds [14:02:51] looks good [14:02:54] 😌 [14:02:56] <_joe_> let's look at latencies on the dashboards [14:03:08] are they bad? [14:03:38] masters looking good [14:03:39] <_joe_> we had a peak in 5xx, recovering it seems [14:03:47] ^seeing the same [14:03:50] <_joe_> yeah things seem stable [14:03:53] read only time was 2:21.68 (14:00:32.114116 - 14:02:53.790615) [14:04:05] app server workes look fine so far? [14:04:08] <_joe_> volans: expected as we have the k8s clusters too [14:04:11] the 5xx was all POST during the RO time [14:04:15] <_joe_> cdanis: yes [14:04:16] so that seems quite reasonable [14:04:17] volans: thanks, I was search for that stat [14:04:31] akosiaris: as a proper manager :-P [14:04:34] searching* [14:04:48] volans: 🙊 [14:05:29] mw-on-k8s codfw latencies are bouncing up to the level they were on eqiad before the switch, looks ok [14:05:43] <_joe_> claime: yeah I was about to say, nothing worrisome [14:05:45] read only errors gone [14:05:48] \o/ [14:05:52] <_joe_> it's just that edits are more expensive [14:05:58] a few "ApiUsageException: Search is currently too busy. Please try again later." [14:06:00] <_joe_> kamila_: now you can breathe [14:06:03] hey look, the world didn't end :D [14:06:25] but so low that could be "normal" [14:06:40] 33 in the last hour [14:07:01] I personally would have probably waited a bit for the TTL restore, just in case, but all seems good so far :) [14:07:16] <_joe_> volans: I disagree [14:07:30] <_joe_> in an emergency we can wipe the caches [14:07:43] I am looking at the errors in case there is something to learn about, not because I am worried [14:07:46] <_joe_> and maybe we should change the cookbook to do just that [14:07:58] <_joe_> jynus: then this isn't the correct channel where to report them though :) [14:08:00] email flowing [14:08:03] sorry [14:08:20] all looking good [14:08:33] recent changes flowing as well [14:09:01] claime: thanks for checking! [14:09:06] Great job kamila_, congratulations [14:09:10] From my side I don't see anything weird [14:09:29] <_joe_> indeed well done! [14:09:37] 👏 [14:09:39] congrats! :D [14:09:43] nice! [14:09:46] <_joe_> kamila_: you can also run puppet on the db masters I think [14:09:47] \o/ [14:09:49] <_joe_> marostegui: can we? [14:09:51] 👏 [14:09:55] yep [14:10:01] Awesome, well done kamila_ [14:10:11] <_joe_> SMOOTH OPERATOR! [14:10:26] nice work y'all :-) [14:10:27] :D [14:10:27] pc names still to switch, right? [14:10:37] claime: yep [14:10:38] I am on it [14:10:41] ack [14:10:45] multimedia search seemed to have some temporary higher latency, now gone [14:10:57] I have hereby proven that I am as skilled as a trained monkey :D [14:11:01] <_joe_> arnaudb: you might be the next vic^H^H^H candidate to run it :) [14:11:11] Am seeing some 500s on codfw appserver [14:11:13] * arnaudb starts sweating [14:11:30] <_joe_> claime: check logstash [14:11:34] yep on it [14:11:51] <_joe_> but the amount is very low, 0.5 rps vs 5k rps [14:12:08] It's higher than it was in eqiad before, that's why I want to check [14:12:11] it's still higher than it used to be [14:12:14] ^that [14:12:28] <_joe_> sure, I'm saying it's not super worrisome in itself [14:12:47] ack, thanks [14:12:50] <_joe_> those might be related to [14:12:51] I don't think it is mw- I don't see the same errors there than on edge, so it is something else [14:12:52] <_joe_> PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is CRITICAL [14:13:00] ^that is what I saw [14:13:12] <_joe_> it's very possible that's the source of errors [14:13:21] <_joe_> remember, we moved periodic jobs as well [14:13:44] <_joe_> which means cirrus is receiving more traffic locally (not for its own updates though) [14:14:04] kamila_: parsercache cnames merged [14:14:06] <_joe_> but yeah that rate of 500s is annoying [14:14:18] I can't find a cause [14:14:25] cxserver has 80% availability [14:15:10] maybe it is normal, idk: https://grafana.wikimedia.org/goto/RBYKSrmIz?orgId=1 [14:15:24] <_joe_> claime: from logstash I'd say the error is [14:15:27] <_joe_> [{reqId}] {exception_url} ApiUsageException: Search is currently too busy. Please try again later. [14:15:53] <_joe_> https://logstash.wikimedia.org/goto/dc4170014ff1d38d739c0fdec14ccaba [14:15:58] I wasn't sure if that'd cause a 500 [14:16:24] <_joe_> we protect search with poolcounter [14:16:28] I got it too [14:16:32] <_joe_> so we have a maximum concurrency of requestrs [14:16:33] searching on enwiki [14:16:43] https://en.wikipedia.org/w/index.php?fulltext=1&search=fire&title=Special%3ASearch&ns0=1 [14:16:43] <_joe_> inflatador, ryankemper, gehel ^^ [14:16:58] Yep, on it! [14:17:01] the direct match of first artcile works [14:17:02] jynus: it's back at 94% and rising. [14:17:08] akosiaris: yeap, saw it [14:17:12] interesting though [14:17:15] if I hit the "search pages with ..." [14:17:18] then I get the error [14:17:23] <_joe_> tge errirs are giubg diwb' [14:17:27] akosiaris: those are the learning I wanted to flag [14:17:29] <_joe_> oh sigh off by one [14:17:30] *learnings [14:17:33] <_joe_> the errors are going down [14:17:41] an entire sentence off by one [14:17:42] lol [14:17:50] <_joe_> akosiaris: one hand only [14:17:55] <_joe_> it's even better :D [14:18:12] and now I got the results for the same page (consistent with recovering) [14:18:15] <_joe_> most errors I see are on commons [14:18:52] edit rate looks good [14:18:56] (general) [14:19:31] kamila_: you can go ahead with mwmaint switch etc. while we debug [14:19:54] ok, thanks [14:21:12] when do we update the status page? [14:21:52] If the 5xx on search are resolving, we can update it now I think [14:22:53] <_joe_> volans: just happened :) [14:23:27] thumb generation look a bit slow- but I don't have concrete data to back that up except loading Commons' Special:NewFiles [14:24:23] <_joe_> I'd take a grafana dashboard, thanks [14:24:41] he he [14:25:23] looks like the 500s are now almost gone? [14:25:32] jhathaway: merging your puppet patch, typo fix [14:25:40] arturo: thanks [14:26:53] <_joe_> akosiaris: yeah, mostly [14:28:55] Re: graph- the pods showed a spike in activity but no saturation that I can see [14:30:23] <_joe_> yes errors are back to baseline AFAICT [14:37:28] congrats everyone, another smooth one :) [14:43:13] indeed :-) [14:47:48] kamila_: I think mark was mostly thinking about you to congratulate (https://wikimedia.slack.com/archives/C05FWANFT8X/p1695220607695639), but of course I am aware of the great work the serviceops team did, and many others in other teams! [14:48:38] mark? [14:48:43] I sent that one? [14:49:02] yes, I was referring mark was congratulating her, as you did there [14:49:08] :-D [14:49:14] ah, sorry, I misunderstood then [14:50:28] sorry if I worded it weird [14:51:06] thanks ^_^ [14:51:30] but the congratulations really should go to the people who made this sufficiently non-scary to be doable by a relative n00b :D [14:52:22] I have a bit of a problem with scap, which only started when I switched to deploy2002 yesterday. I see from the motd on deploy1002 that it /should/ work if I deploy from deploy1002, but I thought I'd check here before doing so. [14:53:10] I'm trying to deploy superset to superset-next as per here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Superset/Administration#Deploy_to_staging but scap on deploy2002 is deploying the wrong version. [14:56:33] <_joe_> btullis: uh wait [14:56:43] <_joe_> where did you update the git repo? [14:56:43] Will do. [14:56:56] https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/957938 [14:57:10] <_joe_> no i mean, you should pull it to the server [14:57:38] https://www.irccloud.com/pastebin/RWjlIC4j/ [14:58:16] <_joe_> btullis: and it's still deploying the wrong version? [14:58:30] <_joe_> because that didn't happen to akosiaris afaik [14:58:47] <_joe_> he used scap to deploy restbase [14:59:26] Yup, multiple times now. Didn't happen all week on deploy1002 but has happened all the time from deploy2002 since yesterday. [14:59:59] <_joe_> uhhh I have no idea what's going on then [15:00:01] I even forced a revision: [15:00:04] btullis@deploy2002:/srv/deployment/analytics/superset/deploy$ scap deploy --verbose -r 21b76519ee204c564d4e0c446b8f1df0ff592e74 --no-log-message -f -l an-tool1005.eqiad.wmnet "Testing upgrade to version 2.0.1" [15:00:16] <_joe_> jnuche: ^^ can you help? [15:01:22] I was thinking of trying a deploy from deploy1002 just for diagnostic purposes and capturing the verbose logs. [15:01:44] <_joe_> btullis: do me a favour: try to update the git repo on 1002, then deploy from 2002 [15:01:57] Will do. [15:02:01] <_joe_> and see if the version is the correct one [15:03:41] <_joe_> I think the problem are the remote set in the scap targets [15:05:19] _joe_: I think you're onto something. Not conclusive yet, but will get back to you. [15:07:27] <_joe_> yeah kamila_ we're in for some creative bashing tomorrow [15:07:38] yay [15:07:46] here I was worrying it would be boring [15:07:58] <_joe_> basically we need to change the remote in every scap::target host [15:08:23] <_joe_> but it's a "ping me tomorrow morning" problem [15:08:23] oh, right :D [15:08:50] deploy2002:/srv/deployment/restbase/deploy$ git remote -v [15:08:50] origin https://gerrit.wikimedia.org/r/p/mediawiki/services/restbase/deploy.git [15:08:51] btw [15:08:59] so I wasn't bitten by this for RESTBase [15:09:17] <_joe_> akosiaris: go check the remote on a restbase host [15:09:49] IIRC we hit this issue in the past and AFAIK it was solved [15:10:00] (the change of remote) [15:10:06] am I misremembering? [15:10:12] volans: WDYM by "solved"? [15:10:46] restbase1021:/srv/deployment/restbase/deploy-cache/cache$ sudo -u deploy-service git remote -v [15:10:46] origin http://deploy2002.codfw.wmnet/restbase/deploy/.git [15:10:52] so, it's the correct one [15:10:55] volans: there is no step taking care of that in the mwmaint switch instructions, so I'm not aware of any automation taking care of it [15:11:07] kamila_: in the scap software [15:11:11] ah [15:11:24] I'm trrying to find some reference though [15:11:29] <_joe_> volans: I don't think it was [15:11:30] I might have dreamed about it :D [15:11:41] <_joe_> but basically, url = http://deploy1002.eqiad.wmnet/restbase/deploy/.git/modules/restbase [15:11:47] <_joe_> this is the issue [15:12:00] The analytics/superset/deploy repo might have just got left behind. Thanks all. [15:12:02] yes and that rings a bell [15:15:08] found it T197470 [15:15:09] T197470: find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 [15:15:31] so, stupid question: why isn't it using the deployment.eqiad.wmnet alias? [15:15:40] https://phabricator.wikimedia.org/T197470#9097056 last message [15:18:07] <_joe_> kamila_: ssh host key I guess [15:18:13] <_joe_> but I was thinking about it [15:18:15] ah yes that would be a thing [15:18:31] I said it was a stupid question :D [15:18:56] <_joe_> it might be easier to automate as a fix [15:19:09] also that alias should really be changed to soemthing that doesn't have the DC in it's name :D [15:19:28] but that's me :) [15:19:36] <_joe_> volans: indeed! [15:19:41] that too :D [15:19:54] <_joe_> (to his latter message, who cares of the name, it's a label [15:19:59] <_joe_> :D [15:25:28] btullis, _joe_: sry, just saw the ping [15:25:42] looks like you found the issue [15:26:39] Yes thanks. Have a workaround for now and a full fix possible. [16:10:04] I see a lot of writes on s1 https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&viewPanel=37 and increased latencies and requests rates for parsoid and appservers [16:30:49] stared at 15:16 [16:41:38] s1 writes seem to have stopped now: https://grafana.wikimedia.org/goto/3e6sTriIz?orgId=1 [16:41:54] (increased one, I mean) [16:41:58] *ones [16:42:26] but I can try to check what it was through the binlog [16:46:59] I think it is templatelinks (template update) [16:47:54] if you look at the s1 master graph, the increase is less dramatic: https://grafana.wikimedia.org/goto/ZpYi09mIz?orgId=1 (as replication makes the increase 16x worse through replication) [16:52:40] it could also be: MediaWiki\Deferred\LinksUpdate\LinksUpdate::updateLinksTimestamp [16:53:00] I've see more pages purged than usual at that time, unsure why [16:53:36] 25842 out of 47723 updates where that function around those dates [16:54:10] and updates firt the shape of the load: https://grafana.wikimedia.org/goto/xE8hAriSk?orgId=1 [16:54:12] *fit [18:28:50] slyngs, moritzm, I just tried to create an account at https://idm.wikimedia.org/signup (using root@wmcloud.org as the email address) and I never got the confirmation email. And I don't see the account in ldap, which I assume is no coincidence. [18:28:55] Any suggestions about how to move forward? [18:45:20] andrewbogott: Yes... First up it's not working because we require the email to be unique, the solution is to either user a unique email, preferable the email for the team responsible for the account so +@wikimedia.org. I assume that it's a service you need an account for, if not we need to come up with another solution [18:46:16] slyngs: I made a different account with a different email and it worked. [18:46:24] Does/could the sign-up page explain that about the email address? [18:46:41] Does, NO, could... Absolutely [18:47:47] If there is a use case which we've overlooks please do let us know and I'll find a solution [18:50:52] mostly the silent failure seems like a problem [18:59:47] andrewbogott: I see a "550 Address root@wmcloud.org does not exist" in my spam folder [19:00:04] which also includes the confirmation link you're looking for [19:00:33] well that raises new questions :) [19:01:02] (mostly why my google search can't find that message) [19:01:17] setting up mail for wmcloud.org is https://phabricator.wikimedia.org/T278109 [19:01:24] do you want me to forward that to you? [19:01:37] sure, thanks [19:03:04] you probably shouldn't use that confirmation link, though, right? since the purpose is to make sure every account has a working email address, and that one doesn't work :) [19:03:07] just to be explicit [21:03:55] _joe_: looks like mw-on-k8s is less stable reaching etcd than baremetal https://phabricator.wikimedia.org/T346971#9185168 [22:15:07] <_joe_> Krinkle: I replied with a few options [22:16:23] <_joe_> basically, 1) network not perfectly setup on container startup 2) more apc evictions exposing a bug 3) dns resolution failures causing this kind of exceptions without error [22:16:23] Thx [22:16:40] <_joe_> the code does its best to hide what's wrong from us :P [22:16:44] I'll see if we can add detail to determine 3 [22:17:33] <_joe_> but it's definitely strange this happens with no etcd config available [22:18:00] <_joe_> and in pods with little apc fragmentation and no evictions I can see [22:18:12] <_joe_> so why would the config not be available, even stale [23:35:50] if anyone is available to review and assist with merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/959359