[05:32:21] Hello, need a review https://gerrit.wikimedia.org/r/c/operations/puppet/+/815830/ :) [07:40:32] Amir1: db1168 (s6) depooled for a reason? [07:40:43] marostegui: running schema change [07:40:52] did I step on your toes? [07:40:56] no no [07:41:01] I was just reviewing depooled hosts :) [07:41:53] marostegui: I just started it, it'll take ten hours maybe and then it'll show up [07:42:33] XDDDDD [09:31:55] Amir1: Old s7 master is fully repooled by the way, you can depool and do anything you like with it! [09:32:05] awesome [09:58:46] marostegui: remember the transfer.py issue? [10:04:08] the disconnect one? [10:05:45] yeah, I was able to reproduce it with otrs [10:06:03] and it turns out that the transfer is ok afterwards [10:06:10] so it is the checking logic [10:09:45] and the disconnects could be from the restart of the ssh daemon every day [10:15:46] Ah, interesting [10:15:54] And when it finishes...does it close the connection? [10:16:28] it tries to, but if ssh failed, it won't be able to [10:46:47] jynus: wouldn't expect restarting sshd to interrupt in-flight connections. [11:01:24] marostegui: btw, I did everything I had to do with db2078- you are free to destroy it or set it on fire if necessary [11:01:38] \o/ [11:01:40] Thanks [11:31:41] Amir1: you probably need to give more downtime to s5 codfw, it is 3d behind now [11:31:46] maybe give it till monday? [11:32:12] marostegui: did it this morning 😅😅 [11:32:16] haha [11:32:20] For two days [11:32:29] But probably more is needed [11:32:37] Yeah, to avoid noise during the weekend [11:32:40] We can check tomorriow [11:32:44] At least for the indirect replica [12:04:13] marostegui: is db2157 new? It is the only codfw replica without lag on s5, that's fishy [12:09:42] it is new yes [12:10:22] new but it was installed past week [12:33:28] it has the new schema, that's good [12:53:41] marostegui: I think your schema change is finished on s3. Let me know when I can start the templatelinks one there [13:03:56] I will check [13:04:03] cause I think one host didn't apply it [13:04:07] give me a few [13:25:58] Amir1: so only one host + the master missing, going to run it now [13:26:14] cool [13:30:29] Amir1: do you know if this script would reload the config? /srv/mediawiki-staging/multiversion/MWScript.php extensions/FlaggedRevs/maintenance/pruneRevData.php --wiki=plwiktionary [13:30:45] this won't, let me restart it [13:31:26] it should be restarted now [13:31:51] Amir1: yep,the schema change is going on now! [13:31:52] thanks [13:37:12] awesome [13:37:23] I really should work on the reload of config, this is really annoying [13:37:56] Amir1: I am going to start altering the master today, it will probably take a few hours though, so I don't think you'll be able to start templatelinks change there until tomorrow morning :( [13:37:59] Is that ok? [13:38:17] sure thing [13:38:42] I won't be able to change config until Monday anyway (the train need to stabilize) [15:53:49] hi all, i was just refreshing my self on https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_master_(a.k.a._promoting_a_new_replica_to_master) and i notice that it mensions db-eqiad.php and db-codfw.php however theses files are no longer in mediawiki-config. have the been moved or dose the article need updating? [15:55:53] most of that is now on dbctl [15:56:06] https://wikitech.wikimedia.org/wiki/Dbctl [15:56:54] thanks jynus [15:57:15] and in general most of that should be on a command, but it is not fully automated [15:58:18] ideally, db-switchover current_primary new_primary should do everything, but not yet [15:59:41] would the following be enough https://wikitech.wikimedia.org/wiki/Dbctl#Setting_a_host_as_new_master_and_also_depool_the_previous_master_(which_is_what_we_normally_do_when_we_failover_a_master) [15:59:51] ot do you still need to run db-switchover [16:00:01] so at the moment there are 2 kind of tasks [16:00:06] mw ones and mysql ones [16:00:38] mw ones are handled by dbctl [16:00:50] mysql ones are handled mostly by db-switchover [16:00:56] ack thanks [16:01:08] but I think the plan is to do everything in a single command [16:01:42] in general you shouldn't try to do that on your own - I wouldn't do it myself [16:02:09] try to get the primary up / set the section in read only [16:02:35] we can be in read only for an extended period of time [16:03:28] jynus: i was just going through https://office.wikimedia.org/wiki/SRE/Training_Checklists#Goal:_Oncall_readiness [16:04:00] "I can failover a primary server for the various types of sections." is too ambitious IMHO [16:04:05] but it is for the dbas to decide :-) [16:04:38] jynus: im glad because some of those things did seem a bit daunting to me :) [16:05:15] could you give the following edit a double check make sure i at least made the wiki page correct https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2Ftroubleshooting&type=revision&diff=1998649&oldid=1956328 [16:05:17] so I am suposed to be familiar with it, being formerly a dba, and I would still call m or a if I saw an issue [16:05:22] you opr one of the DBS's [16:05:34] ack good to know [16:06:24] I say ask the dbas tomorrow, they probably have a checklist somewhere else to correct that [16:06:36] ack will do thanks [16:06:42] I can send you, instead a curvball [16:06:57] and give you another track for data recovery instead [16:06:59] *task [16:07:39] for example, I think some db groups still use mw deploys [16:10:29] I think this is a more recent checklist: https://phabricator.wikimedia.org/T313383 [16:11:20] oh nice ill add that to the wiki artuicle as well [16:11:38] * jbond has allready sent a mail to a and m for review [16:12:48] let me challenge you two 2 additional tasks you should be able to do [16:15:55] See the new 2 data recovery steps: https://office.wikimedia.org/wiki/SRE/Training_Checklists#Goal%3A_Oncall_readiness [16:19:08] jynus: i have never tried either of them but the wikitech articles seem reasnable [16:19:19] yeah, that is up to date [16:19:39] just, if you try a deletion, make sure it is in dry mode :-D [16:20:16] ack :) [18:12:12] jbond: I have done a couple of switchovers recently. What we do these days is following the checklist created by the script [18:13:11] yeah, I think the doc needs update- even if the update is to delete and point to the script :-) [18:14:12] Amir1: what's the process for adding db grants for a new mediawiki appserver? taavi just pointed out that I set up the grants for striker but not for wikitech. IPs are 208.80.154.150 and 208.80.155.117 [18:14:40] jynus: I'll do that [18:14:59] andrewbogott: you should usually do it manually and then create a patch in puppet to future cases [18:15:13] (e.g. new hosts) [18:15:18] Amir1: ok, so same as https://gerrit.wikimedia.org/r/c/operations/puppet/+/815378/3/modules/profile/templates/mariadb/grants/production-m5.sql.erb#220 [18:16:06] looks like we never documented the existing labswiki access grants [18:16:29] but labswiki lives on production dbs, not on m5 anymore [18:16:38] yup but those don't mean it gets applied automatically as those pages have significant drifts with production already :) [18:16:39] *core [18:16:47] taavi: yeah, I was just noticing that :( [18:17:12] jynus: correct, I'll need another patch like that one but for the proper (non-m5) db host [18:17:15] jynus: I think that's also not the documented in core grants [18:17:22] which would be 'core' right? [18:17:23] my hope is to get rid of it tbh [18:17:36] I quite dislike grants to the public network IPs [18:18:35] something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/816026/ I think [18:19:02] ah, I see, you want to add a new app server, I understood it wrongly- I though a new db was to be added [18:19:13] taavi: s6 ;) [18:19:21] jynus: yep, just doing a hardware refresh [18:19:36] Amir1: too used to dealing with centralauth :P fixed [18:19:54] haha [18:20:18] not sure if it will also need update to firewall [18:20:42] firewall rules are already in place (I am seeing 'access denied' errors in the logs) [18:20:59] I think those come from the labweb_hosts hiera key [18:21:09] ok [18:23:16] Amir1: have the bandwidth to manually apply that now? [18:25:25] andrewbogott: it might take a bit [18:25:44] Amir1: ok, just ping me if/when ready :) thank you! [18:25:59] sure [21:46:48] we are seeing occasional read-only errors on wikitech, due to s6 replag. I don't think this is due to any of my recent cloudweb maneuvers... [21:46:50] https://orchestrator.wikimedia.org/web/cluster/alias/s6 [21:48:56] from sal messages, I think it's T312863 [21:48:57] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [22:18:32] andrewbogott: db1098 is lagging but it's not pooled https://noc.wikimedia.org/dbconfig/eqiad.json [22:20:01] Amir1: ok -- any idea why the replag errors on wikitech then? https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors?_g=h@42b0d52&_a=h@c759faa [22:21:37] I don't know exactly but if it's something with s6, then we would have had a flood of such errors from ruwiki, frwiki, jawiki [22:22:32] it can be that the apache is not getting updated with etcd changes [22:23:03] in labweb1001 [22:23:28] so it still thinks it's pooled (and lagged, hence read-only?) [22:23:46] I'm not sure I'm following [22:23:58] I want labweb1001 and 1002 to be pooled, and cloudweb1003 and 1004 to be depooled [22:24:44] does anything on the server itself need to know what's pooled? I thought that was strictly an lvs thing [22:24:50] (maybe that's what you meant) [22:25:26] the db config is on etcd and mw looks it up to see what dbs to connect to [22:25:47] I'm not talking about labweb being pooled or not [22:26:41] oooh i see [22:27:13] so, let's see, how can I ask a labweb what dbs it sees... [22:28:08] eval.php and look up $wgLBFactoryConf [22:29:17] it can be also that it doesn't have grants to read heartbeat [22:29:34] (that gives the replag to mw) [22:31:52] wgLBFactoryConf is enormous so don't totally know what I'm looking for [22:31:55] but I do see [db1098:3316] => 10.64.16.83:3316 in there [22:32:38] that's not important . get ['sectionLoads']['s6'] [22:33:30] s6 looks right to me [22:33:32] https://www.irccloud.com/pastebin/MnZT61HZ/ [22:34:19] so that's not the reason. It's probably heartbeat grants [22:34:28] did you give the heartbeat grants? [22:35:52] These hosts have been working for years without problem so... probably? [22:36:00] Unless something changed elsewhere [22:37:14] I thought these are new [22:38:44] Nope, the new servers are depooled because I'm waiting for (I thought, you?) to create the grants. That's https://gerrit.wikimedia.org/r/c/operations/puppet/+/816026 [22:39:00] The errors are showing up on the old servers, the ones I hope to decom soon. [22:40:37] It's possible this is related to the new servers since I pooled them and then depooled them. But I don't know why that would cause DB errors unless there's some kind of inter-app-server communication going on that I don't know about. [22:41:43] as long as they have the same IP, they should be able to read heartbeat so I don't knwo [22:41:59] I can look into it tomorrow [22:42:39] ok. It's maybe worth ignoring until we have the new servers up; maybe moving the pieces around will make the error go away :) [22:43:07] in the mean time, https://people.wikimedia.org/~ladsgroup/omg/ you can select the IP of the old labweb in target and filter out [22:43:16] to see the grants [22:43:30] so you'd know what to duplicate