[01:11:00] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 6.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:12:50] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [07:31:18] I am going to start working on https://phabricator.wikimedia.org/T330619 [07:31:41] I have a pile of emails to read but this has more priority. If there's something I need to focus on before that, please let me know [09:18:44] I think ms-fe2013 is a WIP setup host, can I ack its backend timeout alert? [09:20:26] jynus: yes, per T326848 there seems to be some problem with the firmware that's waiting for input from jbond [09:20:27] T326848: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 [09:20:54] (thanks!) [09:21:08] yeah, that's super ok, I am only trying to make accounted issues not spam ongoing alerts [09:21:36] I am acking with an expiration date of one week [09:44:24] marostegui: Thank you <3 [09:44:47] :) [09:46:00] Also, I didn´t know pc was supposed to switch master DNS aliases, does it have an actual production impact, or is it just for consistency's sake? [09:46:11] nah [09:46:17] none of those things are used in production [09:46:28] it is mostly for us [09:46:37] a'ight [10:42:47] I have finished https://phabricator.wikimedia.org/T330619 but I would appreciate a second check making sure everything that is supposed to have codfw -> eqiad replication is done [10:42:54] I have done: sX, x1, pcX, es4 and es5 [10:43:02] Anything I could have forgotten? [10:43:15] (x2) already has it [10:58:08] I have more of a question... I know x2 was a source of issues because what, at some point, dbas were told and what mw actually implemented was different [10:58:47] is there a place where x2 and pc expectations are documented, specially related to replication? [11:00:24] I remember there was some discussion if to leave some of those independent or replicate between sites, but I don't remember where [11:01:14] jynus: I don't know :( [11:01:16] (so not really a comment and more of a documentation- which I can do if you point me to the discussion) [11:01:32] but you know what I am talking about, right? [11:01:34] so x2 will always use circular replication as they are active-active [11:01:39] but parsercache isn't [11:01:54] so pc is dc-local? [11:01:59] jynus: https://wikitech.wikimedia.org/wiki/MariaDB#x2 [11:02:16] yes, pc is like a normal sX in terms of writes and replication [11:02:32] We are ready to make it like x2, but for now it's not been requested [11:02:35] ah, you mean that it switches direction from the primary? [11:02:37] I see [11:02:48] so I would like to add that to the doc for my own sake [11:02:51] yeah, for now you can think of pc = s [11:03:22] although double check if that coincides with the switchover process on the cookbook [11:03:59] basically to point somewhere if someone says otherwise "loook what we have now as documented" and avoid missunderstandings [11:04:44] (I belive pc isn't checked for codfw->eqiad replication) And that is probably ok [11:05:02] yeah [11:05:34] do you remember the ticket or some keywords for that discussion (I think it was discarded because the write dc would have a warmer pc) [11:06:05] probably here: T277831 [11:06:08] T277831: Evaluate WMF's ParserCache database setup for Multi DC - https://phabricator.wikimedia.org/T277831 [11:06:44] this is nice: T277831#8166216 [11:06:57] I will adapt this to the wiki so it is clearer there [11:07:04] with your permission [11:07:23] sure! [11:07:55] basically, trying to avoid missunderstandings, nothing actionable here except what I want to clarify on the doc [11:08:38] yep [11:09:16] after that I can do a double check on replication if you want [11:09:46] Not replication (as that is working fine) but mostly that I haven't left anything behind that is supposed to have codfw -> eqiad [11:10:04] yes, that is what I meant, sorry :-D [11:10:21] topology would be the better word 0:-D [11:10:25] yep thanks [11:11:21] So I think this is what confused me, there is a proposal to make pc like x2 (T277831#8168447) but not done ATM [11:11:21] T277831: Evaluate WMF's ParserCache database setup for Multi DC - https://phabricator.wikimedia.org/T277831 [11:12:21] yeah, it is not yet done [11:12:24] but we are ready for it anytime [11:17:13] Emperor: is it ok to play with ms-fe2013 [11:19:11] jbond: yes, please do - it's not yet in service [11:19:42] Emperor: great thanks [11:19:57] ...so if you can help papaul get it into service, I'll be happy :) [11:21:30] ack i think its all ready for a reimage but just want to check why it errored [11:22:08] Cool, sounds good [11:23:38] jbond: do you mind downtiming in icinga, pls? [11:27:15] Emperor: sure thing [11:28:43] thanks [11:57:05] Emperor: yu should be all goog now. i have double checked and the firmware is all upgraded correctly [11:59:35] Thanks <3 [12:26:59] I know there is no maintenance allowed for the DBs, but I am going to work on this https://phabricator.wikimedia.org/T330653 [12:28:53] marostegui: ok for me to test running b2b sre.switchdc.mediawiki.03-set-db-readonly and sre.switchdc.mediawiki.06-set-db-readwrite or would you rather I do that during one of tomorrow's windows? [12:29:14] claime: if you can give me 1h [12:29:29] No problem, I can go run an errand and will do that when I come back [12:30:02] sounds good, thank you [13:23:35] Hello all, we are planning to move the maintain-dbusers.py script from labstore1004 host to cloudcontrol1005 host. Anyone know how to give the cloudcontrol1005 host the same database grants that labstore1004 has? [13:29:46] Raymond_Ndibe: a useful tool for grant migrations would be pt-show-grants- that lists the grants and can be grep'ed and transformed before reapplied [13:32:32] but I am not familiar with how that works, somtimes it is just a question of updating/adding the right host ips, other times it is more involved (e.g. changing procedures' owner), it depends [13:33:53] example usage: https://phabricator.wikimedia.org/P17328 [13:35:30] Hello :jynus, thanks or dropping these useful pointers. Let me quickly look that these [13:37:18] in the easiest case possible it is "pt-show-grants | grep old_user | sed 's/old_user/new_user/' | mysql" (*not actual code*), but may need more work, depending on the needs [13:37:52] or old_ip -> new_ip, you get the idea :-) [13:37:56] jynus: so we have this labsdbaccounts user+db on m5-master.eqiad.wmnet, and we have to move the script from one host to the other (thus, changing the ip), I guess the question is, should be open a task for it? Do it ourselves? Manually or there's some script somethewere that syncs those? [13:38:25] (any answer is ok xd, just not sure about the process and don't want to mess anything up!) [13:38:48] dcaro: if the IP that will connect to m5-master is going to change, yes, open a task so we can double check [13:38:56] dcaro: let us know the old and new ip please [13:39:30] marostegui: okok, will do then, thanks! (btw. the pt-show-grants tip will be useful for toolsdb xd, so thanks for that too) [13:39:31] dcaro: We are dealing with the DC switchover so I am not sure we'll get to it until late this week [13:39:57] dcaro: in addition to the labsdbaccounts user for maintain-dbusers metadata, you will need the grants to actually create the users on toolsdb and the replica servers too [13:41:35] Oh I thought we were not creating the users on replica servers, but yep, that will be needed. For toolsdb, we are the ones managing that server right? (so no task for data persistence needed?) [13:42:02] correct [13:54:07] marostegui: this is what I added to the docs: https://wikitech.wikimedia.org/wiki/MariaDB#Parsercaches [13:56:27] jynus: thanks - makes sense [13:59:45] I will also add they are by default in read_write all the time, which is something "different" [14:00:02] yep yah [14:04:24] I think the outage I caused a few switchovers ago was about x1, not pc, so will also make double check that is accounted in the script [14:07:02] claime: you can proceed with the tests as you wish [14:11:12] marostegui: thx [14:38:42] marostegui: Tested and successful \o/ [14:39:18] great! [15:43:18] Amir1: https://gerrit.wikimedia.org/r/892478 when you can [15:43:26] I have decided to also do pc1011 today and get this over with [17:04:02] sorry I was afk, done now [17:05:30] no worries [17:05:32] I will push it tomorrow [19:40:43] trying to figure out if this is a data integrity issue or if I'm doing something stupid... first please see https://meta.wikimedia.org/wiki/Special:CentralAuth/%D0%AE%D0%BB%D0%B8%D0%B0%D0%BD_2_%D0%9A%D0%BB%D0%B0%D0%B2%D0%B4%D0%B8%D0%B9 [19:41:00] then we have this query (ran on the Toolforge replicas): `SELECT lu_wiki, lu_attached_timestamp, gu_locked FROM centralauth_p.localuser JOIN centralauth_p.globaluser ON lu_global_id = gu_id WHERE lu_name = "Юлиан 2 Клавдий‬";` [19:41:06] Empty set (0.003 sec) [19:41:14] that doesn't seem right! [19:42:13] also just ran on prod db, empty set [19:43:02] queries for similar usernames return expected results, i.e. `Юлиан 7` or `Юлиан Махаринский` [19:44:25] musikanimal: not on a computer atm, but send me the details somewhere and I'll have a look [19:45:11] I'll create a task. I think I'm confident now there's some sort of issue [19:45:32] I get results when I query localuser or globaluser, and they have matching IDs. But joining, we get nothing [19:53:43] musikanimal: I think you have some character issue for the username. If I use your query but remove the username and instead copy-paste it from Special:CA it works [19:55:24] ah, just as I filed https://phabricator.wikimedia.org/T330696 [19:55:46] yup! you're right [19:55:47] give me a second if I can figure out the exact ussye [19:56:25] what's weird is my bot is using the username from CentralAuth [19:56:29] that's how I noticed the issue [19:58:22] `SELECT * FROM localuser JOIN globaluser ON lu_global_id = gu_id WHERE lu_name = (SELECT gu_name FROM globaluser WHERE gu_id = 68195910)` works, so yeah... no issue with the data methinks [19:58:35] I figured it was something stupid! I do wonder why my bot didn't get it right, either [22:52:38] Finally figured it out! My bot was getting these usernames from parsing signatures, and in this case the username had an invisible character (`0x202c`) after it. It's a unicode control character used for directional formatting, but the user either put it in their custom sig, or it is added when the page language is English (or not Russian). Or something like that. Interesting stuff!