[00:01:46] <jinxer-wm>	 (DatasourceError) firing: Queue (Jenkins jobs + Zuul functions) alert - https://grafana.wikimedia.org/alerting/grafana/iS0FSjJ4z/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:07:01] <jinxer-wm>	 (DatasourceError) resolved: Queue (Jenkins jobs + Zuul functions) alert - https://grafana.wikimedia.org/alerting/grafana/iS0FSjJ4z/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:46:20] <wikibugs>	 (03CR) 10Jforrester: zuul: [mediawiki/extensions/PlaceNewSection] Enable CI (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/1001360 (owner: 10Majavah)
[09:07:46] <hashar>	 taavi: hello, can you check my reply on https://gerrit.wikimedia.org/r/c/labs/tools/train-blockers/+/978139  ? I don't get why `Access-Control-Allow-Methods: GET` would be required there
[09:09:39] <taavi>	 hashar: morning!
[09:09:48] <taavi>	 I don't recall what I had in my mind either then, so merging
[09:09:48] <hashar>	 :)
[09:09:53] <hashar>	 hehe :)
[09:10:16] <hashar>	 I wasn't sure what I have missed and that looked over specific
[09:10:17] <hashar>	 thanks !
[09:11:59] <taavi>	 aand it's live
[09:12:14] <hashar>	 taavi: and while you are around I have one for ircservserv (which I have no idea what it is for) https://gerrit.wikimedia.org/r/c/wikimedia/irc/ircservserv-config/+/967131
[09:12:21] <hashar>	 you are apparently listed as maintainer (with kunal)
[09:12:44] <hashar>	 that looks like something to manage access lists of IRC channels
[09:14:29] <taavi>	 ok, that has one minor issue which I commented
[09:36:10] <hashar>	 taavi: done :)
[09:45:08] <taavi>	 !issync
[09:45:08] <ircservserv-wm>	 Syncing #wikimedia-releng (requested by Majavah)
[09:45:10] <ircservserv-wm>	 Set /cs flags #wikimedia-releng twentyafterfour -Vv
[09:45:12] <ircservserv-wm>	 Set /cs flags #wikimedia-releng dduvall +Vv
[09:45:14] <ircservserv-wm>	 Set /cs flags #wikimedia-releng thcipriani +FRVes
[09:45:16] <ircservserv-wm>	 Set /cs flags #wikimedia-releng andre +Vv
[09:45:18] <ircservserv-wm>	 Set /cs flags #wikimedia-releng jnuche +Vv
[09:45:20] <ircservserv-wm>	 Set /cs flags #wikimedia-releng marxarelli -Vv
[09:45:22] <ircservserv-wm>	 Set /cs flags #wikimedia-releng greg-g -AFRefiorstv
[09:47:44] <hashar>	 taavi: thank you for the merges and deployments :-]
[09:51:26] <TheresNoTime>	 If you'd like something else to fox Ta/avi I have the perfect thing (it rhymes with "ceta buster") /j
[09:51:33] <TheresNoTime>	 s/fox/fix
[09:52:42] <taavi>	 lol
[09:57:02] <TheresNoTime>	 (but joking aside if anyone has any ideas for T358329... all I'm good for here is looking at it and going "hm, yes that's broken")
[09:57:03] <stashbot>	 T358329: beta-update-databases-eqiad job times out - https://phabricator.wikimedia.org/T358329
[09:57:19] <hashar>	 oh no
[09:57:44] <hashar>	 TheresNoTime: some time it is because an extension has some migration script going on which is triggered by update.php
[09:58:01] <taavi>	 TheresNoTime: as I said on friday, the symptoms smell like a database replication issue. but I have not looked any further nor am I going to
[09:58:40] <TheresNoTime>	 would you do it for a hand drawn barnstar?
[09:58:47] <TheresNoTime>	 one of a kind
[09:58:52] * hashar giggles
[09:59:40] <TheresNoTime>	 Or alternatively is there a guide somewhere on how to resolve replication issues?
[10:01:42] <hashar>	 I'd ask #wikimedia-data-persistence
[10:03:51] <hashar>	 and there is no lag based on `foreachwiki maintenance/getLagTimes.php
[10:05:55] <TheresNoTime>	 hashar: problem is, I don't know enough about database/replication to know what to *ask* (:
[10:07:13] <Lucas_WMDE>	 there are a bunch of stuck postmerge changes in Zuul waiting for queued mwext-codehealth-master-non-voting jobs to start btw
[10:09:25] <TheresNoTime>	 hashar: are you trying to run update.php manually? is it progressing or getting stuck?
[10:09:35] <hashar>	 yeah I did
[10:09:40] <hashar>	 it is stuck :)
[10:10:29] <TheresNoTime>	 afaics there's no obvious error logs on the db hosts 
[10:10:47] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Release, 10Train Deployments: 1.42.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T354438#9575737 (10ovasileva)
[10:11:13] <hashar>	 we would want some way to start it in full debug mode
[10:12:08] <hashar>	 but I don't think we have the logic to force enable debug 
[10:19:16] <TheresNoTime>	 <joke>now just theoretically, if I shut down the beta cluster, do you think we'd get more eyes on the issue...?</joke>
[10:20:53] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9575756 (10TheresNoTime)
[10:21:43] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9571610 (10TheresNoTime) p:05Triage→03Unbreak! This seems to be causing cascade failures (edits not being saved/user preferences/etc.) so making UBN
[10:22:19] <TheresNoTime>	 "I'm Doing My Part!" .gif
[10:25:57] <hashar>	 OH MY GOD
[10:26:07] <hashar>	 everything is sooooo broken everywhere
[10:26:17] <hashar>	 so
[10:26:39] <hashar>	 I am going to update the phab ticket for documentation purpose
[10:26:52] <hashar>	 I will then pretend I haven't seen anything
[10:34:03] <hashar>	 TheresNoTime: /home/hashar/T358329.log
[10:37:12] <hashar>	 awaitSessionPrimaryPos: waiting for replica DB deployment-db12 to catch up...
[10:37:33] <hashar>	 and I have NO IDEA how it waits
[10:37:45] <hashar>	 maybe it requeries every 60 seconds
[10:38:14] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9575817 (10hashar) It is hard to tell what is broken really. I have checked with `sql aawiki` and `sql --write aawiki` to check the `SHOW PROCESSLIST;` output bu...
[10:39:24] <hashar>	 and of course neither `maintenance/lag.php` nor `maintenance/getLagTimes.php` report any issue
[10:39:43] <TheresNoTime>	 aaaaa
[10:40:15] <hashar>	 with the comment I wrote, there is probably a good week of work for a couple people :]
[10:49:51] <hashar>	                 Last_SQL_Error: Query caused different errors on master and slave.     Error on master: message (format)='Cannot load from %s.%s. The table is probably corrupted' error code=1728 ; Error on slave: actual message='no error', error code=0. Default database: 'test2wiki'. Query: 'drop database test2wiki'
[10:55:28] <hashar>	 yeah I think that is all I have 
[10:55:37] <hashar>	 given I don't know anything about setting up or fixing replication
[10:55:47] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9575869 (10hashar) On deployment-db12 I have connected to mysql, went with `show slave status \G` in a very wide terminal to prevent message from being truncated...
[10:55:55] <hashar>	 and that could be a misleading error message
[10:57:44] <hashar>	 I'd involve #wikimedia-data-persistence
[10:57:52] * hashar lunches
[11:09:53] <wikibugs>	 10GitLab (Integrations), 10Phabricator, 10Release-Engineering-Team (Priority Backlog 📥), 10collaboration-services: Get GitLab to render `T{\d}+` in MR overviews, comments, etc. as links to Phabricator - https://phabricator.wikimedia.org/T337570#9575901 (10kostajh) >>! In T337570#9569590, @dduvall wrote: >...
[11:13:51] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9575922 (10pmiazga) @hashar yes, I dropped the database as `addWiki.php` script created a partially created schema (script failed in the middle of run), re-runni...
[11:18:58] <pmiazga>	 I tried to run `update.php` on Friday - it looked like it’s slowly processing, but to run 4 migrations it required at least 15 mins. Therefore I don’t know if those migrations were executed or just time outed and update.php didn’t report it
[11:19:53] <taavi>	 pmiazga: how did you drop the `test2wiki` database?
[11:25:10] <pmiazga>	 One sec
[11:26:43] <pmiazga>	 I ssh to db-11 as I couldn’t access db-9, then I found the db there then did `sudo su -` and `mysql` per db808 recommendation.
[11:27:05] <pmiazga>	 And then `drop …` and it worked
[12:03:01] <pmiazga>	 Did we ping anyone from data team?
[12:12:38] <pmiazga>	 hashar: tav: TheresNoTime: any ideas what could we do next? Maybe I can throw some extra logging at update.php script but I don’t know if it is going to tell us more than what we already know
[12:12:49] <pmiazga>	 taavi: ^
[12:14:31] <TheresNoTime>	 well ta/avi initially suggested it was a replication issue, and hash/ar's comment seems to suggest that too.. it might be worth someone who knows how to check & resolve that taking a look 
[12:19:58] <pmiazga>	 Ok, I’ll find someone
[12:52:08] <hashar>	 pmiazga: I don't know much about replication unfortunately. The log I found suggest the replication got broken after `drop database test2wiki` got issued on the master
[12:52:31] <hashar>	 which somehow broke on master but did not on slave
[12:52:40] <hashar>	 so replication bails out cause it think things are out of sync
[12:52:42] <hashar>	 ?
[12:52:50] <hashar>	 I have no idea, I am just making wild assumptions
[12:53:46] <pmiazga>	 Could be
[12:54:17] <pmiazga>	 Also, I don’t know how/when it does the replication - but it’s worth mentioning that `test2wiki` was dropped at least 3 times
[12:55:34] <pmiazga>	 Sorry,  it was dropped two times, first time when it failed on Math extension, then second time after addwiki failed on Linker extension
[12:56:06] <pmiazga>	 On third time it worked, but the failed on inserting Main_Page to newly created wiki, so most likely I will need to drop the db again ;/
[13:03:24] <hashar>	 there is also a long tail of things being broken
[13:03:30] <hashar>	 such as update.php not showing it is waiting for replication 
[13:03:57] <hashar>	 or maintenance/lag.php and maintenance/getLagTimes.php not showing anything
[13:04:08] <hashar>	 but maybe I misunderstand what those scripts are intended for
[13:08:30] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9576181 (10TheresNoTime) ` Feb 26 13:06:45 deployment-db13 mysqld[619520]: 2024-02-26 13:06:45 9 [Warning] Aborted connection 9 to db: 'unconnected' user: 'unaut...
[13:15:05] <hashar>	 pmiazga: we can try restarting the replication and see what happens ? ;)
[13:15:28] <TheresNoTime>	 it failed
[13:15:42] <hashar>	 nice
[13:15:54] <hashar>	 and I also don't understand why MediaWiki doesn't see them lagged :/
[13:16:32] <TheresNoTime>	 I'm looking at giving https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases a go, just making sure I understand it first
[13:16:53] <TheresNoTime>	 (but restoring onto the pre-existing replicas, not creating new ones)
[13:18:53] <apergos>	 before you do that I would check with a dba; the amount of memory wanted might be different, there are meant to be two replicas and one primary, etc. 
[13:19:07] <hashar>	 ah
[13:19:08] <hashar>	 https://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb=
[13:19:27] <hashar>	 gives me the same "false", which maybe is that it indicates the replication is stopped hehe
[13:19:44] <apergos>	 (also you'll need to fix up some wmf config settings with the new stuff_
[13:19:46] <pmiazga>	 Hello arnaudb. Hashar ThereNoTime: Arnaud said he can help
[13:19:58] <apergos>	 well db13 mariadb is shut down for whatever reason
[13:20:00] <wikibugs>	 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, 10serviceops: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9576217 (10Clement_Goubert)
[13:20:08] <wikibugs>	 10Release-Engineering-Team, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507#9576216 (10Clement_Goubert) 05Open→03Resolved
[13:20:16] <apergos>	 note I am only looking around a little, not touching / changing anything
[13:20:24] <arnaudb>	 from where is the data replicated? is it linked to S7 issue of saturday?
[13:20:26] <TheresNoTime>	 apergos: that was me, prior to doing any restoring, have not started (and ack your "check first")
[13:20:38] <apergos>	 the primary is db11, as seen in wmf-config/db-labs.php
[13:20:42] <pmiazga>	 arnaudb: the problem is that on db11 I did a `drop database test2wiki`
[13:21:11] <pmiazga>	 And then something went south with replication on other beta cluster wikis
[13:21:48] <TheresNoTime>	 I am no longer touching things if a DBA is around and wants to look. Restarting gives https://phabricator.wikimedia.org/T358329#9576181, so I think that's a primary point to look at
[13:21:59] <arnaudb>	 checking if I have enough info, will dig if not
[13:22:11] <apergos>	 I mean a few options are: actually create the db test2wiki on db12 and then see if replication can be restarted. or: see if replication can be started skipping that one statement in the bin log (see a dba for that). or...?  probably other things
[13:22:39] <pmiazga>	 I can create new db on db12
[13:22:48] <pmiazga>	 but if I remember right, those db’s are in read-only mode
[13:22:59] <pmiazga>	 So probably I need to lift the flag, do the query, get it back to read-only
[13:23:13] <apergos>	 I would do nothing until a dba say s something :-D
[13:23:44] <apergos>	 the same will hold true for db13 (it also is missing the db, I don't know if it reached that point in the bin log or not)
[13:24:15] <pmiazga>	 I checked the db12 - there is no `test2wiki` db there
[13:24:19] <hashar>	 fun times, if a slave is not replicating, its lag time is `false`  but maintenance/getLagtimes.php cast it to an integer with `intval( false )`  which yields `0`. The same as if there is no lag ehehe
[13:24:30] <apergos>	 that's right, it is not there pmiazga. it is also not there on db13. 
[13:27:11] <apergos>	 arnaudb:  it is unrelated to production; this is the deployment-prep db instances only that we are looking at 
[13:27:55] <arnaudb>	 I figured, having never interacted with those hosts :D
[13:28:01] <apergos>	 :-D
[13:28:42] <hashar>	 would starting the replication again on the master cause the whole `test2wiki` to be replicated "from scratch"?
[13:28:47] <arnaudb>	 so, if I rephrase your issue → you dropped a database but the cluster ended up in an inconsistent state, right?
[13:28:48] <hashar>	 I know nothing about how that works
[13:28:50] <apergos>	 https://mariadb.com/kb/en/set-global-sql_slave_skip_counter/  this is what I meant about skipping the problem statement in the bin log.  but I  don't know if that's the best move.  still, linking it here just for our info
[13:28:55] <pmiazga>	 If I can say something - when I started interacting with beta cluster - someone told me - don’t worry, you can’t break what’s already broken
[13:29:17] <arnaudb>	 you could but it can also trigger some issue with your replication consistency
[13:29:18] <pmiazga>	 arnaudb: most likely yes, after `drop database` it stopped working
[13:29:41] <arnaudb>	 and other dbs are still r/w available?
[13:29:54] <pmiazga>	 Also - I don’t care about `test2wiki` at all - this is a newly created db for new test wiki on beta cluster
[13:30:07] <pmiazga>	 We can drop it and I can re-run the script later once everything is back to normal
[13:30:18] <pmiazga>	 Right now it has no data, just schema -
[13:30:39] <pmiazga>	 Yes, looks like other dbs are available - but somehow writes do not go trough
[13:30:40] <wikibugs>	 10Release-Engineering-Team, 10MW-on-K8s, 10SRE, 10Traffic, 10serviceops: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507#9576241 (10Clement_Goubert)
[13:30:50] <arnaudb>	 hum I've never experienced this
[13:31:04] <pmiazga>	 also the `update.php` script stopped working - it takes ages to go trough migrations, looks like it waits for replication to sync
[13:31:06] <arnaudb>	 without having any hand on this cluster it's tough to help tbh!
[13:31:07] <wikibugs>	 10Release-Engineering-Team, 10MW-on-K8s, 10SRE, 10Traffic, 10serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508#9576243 (10Clement_Goubert) 05Stalled→03In progress
[13:31:14] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Scap, 10MW-on-K8s, 10SRE, 10serviceops: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402#9576244 (10Clement_Goubert)
[13:31:19] <wikibugs>	 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, 10serviceops: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9576245 (10Clement_Goubert)
[13:32:06] <pmiazga>	 Arnaudb - in short -> I executed `addwiki.php` script that failed in the middle due to missing/bad migration files. The `test2wiki` schema was partially created, I couldn’t run the script again because then it was failing with `table already exists`. And we went with idea - lets drop broken db and recreate it from scratch
[13:32:07] <apergos>	 I should be able to add you to the project as an admin if you are not there already, arnaudb
[13:33:57] <apergos>	 shall I?
[13:34:00] <arnaudb>	 sure!
[13:34:52] <TheresNoTime>	 would a possible option be to follow https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases#On_Existing_DB onwards (i.e. backup from the working master, to a (stopped) pre-existing replica, then `--apply-log` to that replica and restart the db/replication)?
[13:35:41] <TheresNoTime>	 (last time a similar-ish thing happened, iirc we just created new replica hosts entirely from scratch)
[13:35:48] <apergos>	 I have added you, arnaudb, please verify that you have access to the dbs and can sudo -s
[13:36:19] <pmiazga>	 BTW, TheresNoTime attached the log from replication
[13:36:22] <pmiazga>	 > Feb 26 13:06:45 deployment-db13 mysqld[619520]: 2024-02-26 13:06:45 6 [Warning] Slave: Can't drop database 'test2wiki'; database doesn't exist Error_code: 1008
[13:36:25] <arnaudb>	 I'm sorry apergos how can I access them ? I've never went there :D
[13:36:36] <pmiazga>	 so slave tries to drop the database
[13:36:40] <apergos>	 you need ssh keys for wmcs instances
[13:36:46] <arnaudb>	 I have
[13:36:56] <pmiazga>	 Maybe we could add the `test2wiki` empty db on both slaves, and then restart replication ?
[13:37:06] <arnaudb>	 pmiazga: is the query still visible in show processlist?
[13:37:09] <apergos>	 then try ssh deployment-db11.deployment-prep.eqiad1.wikimedia.cloud
[13:37:12] <apergos>	 and so on
[13:37:32] <pmiazga>	 arnaudb: not sure - this is what TheresNoTime got https://phabricator.wikimedia.org/T358329#9576181
[13:37:37] <apergos>	 that presumes your ssh config is set up to use the right agent for wikimedia.clouf
[13:37:39] <apergos>	 cloud
[13:38:08] <arnaudb>	 everything's setup yes :)=
[13:38:10] <arnaudb>	 thanks
[13:38:43] <apergos>	 if you're in, good. if not, you might need to go via restricted.bastion.wmflabs.org  (or maybe you have that set already) for any labs instance
[13:38:57] <arnaudb>	 I'm on it, thanks
[13:39:11] <arnaudb>	 will check and will notify if I find anything
[13:39:16] <apergos>	 ah I see you are there. getting off, and thanks
[13:39:23] <arnaudb>	 keep an eye on the chan, feel free to ping me :)
[13:39:30] <pmiazga>	 I got information from marostegui
[13:39:33] <pmiazga>	 > 2:36 PM <marostegui> pmiazga: I am out today but...we do not maintain BETA cluster. I don't even have access to it. And quickly reading the first ticket, it looks like to me that someone dropped the database on the slave, you can try to see if executing this on the slave will work: set session sql_log_bin=0; create database test2wiki; stop slave; start slave;
[13:39:38] <arnaudb>	 like a message from heaven
[13:40:08] <pmiazga>	 > 2:38 PM <marostegui> pmiazga: If you did it 3 times on the slave (https://phabricator.wikimedia.org/T358329#9575922) you will need that 3 times as soon as replication keep breaking
[13:40:18] <pmiazga>	 The only difference is that I did it on master
[13:40:37] <pmiazga>	 But still, maybe this is the thing, -0 arnaudb apergos - what do you think ?
[13:40:39] <arnaudb>	 how many slaves do you have in this cluster?
[13:40:43] <apergos>	 that's a big difference 
[13:40:47] <apergos>	 2 replicas
[13:41:04] <pmiazga>	 Should we try that? Get to db12, create test2wiki and stop/start slave ?
[13:41:10] <apergos>	 https://noc.wikimedia.org/conf/highlight.php?file=db-labs.php
[13:41:12] <pmiazga>	 The only thing I’m worried is that `sql_log_bin=0`,
[13:41:12] <arnaudb>	 so I have an angle
[13:41:46] <apergos>	 I want to wait for our dba right  here in the channel to scope things out and see what they report back, tbh.  go ahead arnaudb
[13:42:02] <arnaudb>	 actually, I'm not sure of my angle so please check with your dba
[13:42:07] <apergos>	 :-D
[13:42:24] <pmiazga>	 Ok, I can wait -> just trying to help - as I feel guilty of this craziness
[13:42:45] <arnaudb>	 but basically what I would try that could end up breaking the cluster even more would be to drop the db with sql_log_bin=0 everywhere and then restart the replication
[13:42:59] <apergos>	 I know it's uncomfortable having things broken.  but if we are a little cautious, it's ok.  things can remain broken a little longer just to double check things, eh?
[13:43:15] <arnaudb>	 but → I'm nothing more than a SRE gravitating around DBAs so I'm only so far in my jedi learning :P
[13:43:34] <pmiazga>	 I’m happy that I can provide some learning opportunities ;)
[13:44:08] <pmiazga>	 Sure, at this stage beta cluster is broken but so far I don’t know if this is breaking anyone workflow, probably could wait till tomorrow
[13:44:28] <TheresNoTime>	 arnaudb: It's Only Beta(tm), but in seriousness providing such an action would at the very least leave the master in a valid state to build new replicas from, it is maybe worth trying
[13:44:31] <apergos>	 it can at least wait 30 mins while we poke around
[13:44:37] <pmiazga>	 as it’s already broken from Thursday
[13:45:05] <arnaudb>	 TheresNoTime: I can't take the responsability, if you feel confident enough (or able to restore your dbs xD), go ahead
[13:45:21] <pmiazga>	 Btw - can we do some kind of snapshot there?
[13:45:48] <apergos>	 arnaudb:  now I get to ask you questions: I thought sql_log_bin=0  turns off writing to the bin log? if so, would that not be something we just run on the primary? 
[13:46:03] <pmiazga>	 From what I heard from jcrespo - is that we don’t have backups for beta
[13:46:20] <arnaudb>	 yep but since there was no creation on your slave if you drop the db with replication off it'll avoid trying replicate that action
[13:48:58] <apergos>	 ok please ELI5:  we want to have replication off on the replicas ( STOP ALL REPLICAS on primary?)  then no more writing to bin log (sql_log_bin=0) then make sure that test2wiki is gone everywhere (primary and replicas), then sql_log_bin=1, then START ALL REPLICAS (on primary)?   is this a correct understanding of what you want to do?
[13:50:43] <arnaudb>	 sorry I expressed myself poorly, to me there was a global replication stop (i.e. on all replica server as well, with a basic "stop slave;", since it's a luxury you can afford, and then on the master you send the log bin disabling, drop the db, start replication on slaves, enable binlog, create db
[13:51:10] <apergos>	 ok
[13:51:26] <arnaudb>	 bug again, not a db "per se" so I'd be double checking that angle x)
[13:51:42] <pmiazga>	 Please keep in mind - I did the drop at least two times - so it is possible that If you fixed it for the first drop, it may break on the second drop
[13:51:52] <apergos>	 and we might have to do this a few times as replication might break more than once
[13:51:57] <pmiazga>	 So we would have to do the same thing twice to get it go trough the log
[13:52:03] <arnaudb>	 then rince & repeat the operation maybe as Manuel seemed to suggest pmiazga ?
[13:53:26] <pmiazga>	 Well, my understanding is that until we start doing some queries on master - we should be good with whatever we want to do on slave
[13:54:14] <pmiazga>	 I mean, the problem is on slave, that it tries to drop the db that doesn’t exist - worst case scenario we break the slave entirely (but it’s already broken) and we would have to create it from scratch - but if we don’t touch master -> then hopefully it should work.
[13:54:30] <pmiazga>	 Or at least the newly created slave would take the entire copy of master
[13:54:42] <pmiazga>	 created -> recreated
[13:54:48] <apergos>	 we are going to touch the primary. we will be dropping that db from there. maybe more than once. 
[13:55:15] <arnaudb>	 or, different angle, you can promote a replica to become primary?
[13:55:21] <pmiazga>	 So the db on primary (db11) was dropped manually
[13:55:26] <arnaudb>	 and then just clone from the new primary
[13:55:28] <pmiazga>	 I didn’t touch db12 or db13
[13:55:31] <apergos>	 we might lose whatever is not in the bin logs from the weekend
[13:55:35] <apergos>	 so I would be reluctant
[13:56:03] <wikibugs>	 10Beta-Cluster-Infrastructure, 10MediaWiki-Platform-Team, 10MW-1.42-notes (1.42.0-wmf.20; 2024-02-27): Cannot create a new wiki on beta cluster - https://phabricator.wikimedia.org/T358236#9576344 (10kostajh) >>! In T358236#9569319, @pmiazga wrote: > After all fixes now script still fails, but this time on so...
[13:57:05] <apergos>	 *whatever is in the bin logs (not replicated).  sorry
[13:58:07] <TheresNoTime>	 I'm not sure data loss from the weekend on the beta cluster is a significant concern
[13:58:34] <apergos>	 I don't really know, because I only use it for my own little tests (for which that is not an issue)
[13:59:15] <wikibugs>	 10Beta-Cluster-Infrastructure, 10MediaWiki-Platform-Team, 10MW-1.42-notes (1.42.0-wmf.20; 2024-02-27): Cannot create a new wiki on beta cluster - https://phabricator.wikimedia.org/T358236#9576369 (10pmiazga) @kostajh nope, we didn't do `--skip-clusters`. On first run it failed with missing migration, we fixe...
[14:04:06] <TheresNoTime>	 have we made any decisions on things to try? :)
[14:05:33] <pmiazga>	 I’m tempted to create the db on db12 (first slave) and see what it brings, I wouldn’t touch db11 (master) and db13 (second slave). But I’m not DBA and honestly I have no idea if this is not going to break things even more. Therefore maybe it’s better to wait till tomorrow
[14:06:35] <apergos>	 ok so the options seem to be, realistically:  clone a replica and promote it; clone the primary and make that a new replica and then make a second replica from that (if we know both db12 and 13 are problems); make sure replication is off to both db12,13 and create the database there, then enable replication (might need to repeat this a few times); skip the problem statement in the bin log and re-enable replication for db12 and 
[14:06:35] <apergos>	 see what happens (might need to repeat on db13, might need to repeat a few times too)
[14:07:37] <apergos>	 is that a fair summary?
[14:09:46] <TheresNoTime>	 sounds it to *me*, and given the 3rd/4th option doesn't appear to dramatically affect the master its worth trying prior to any creations of new replicas
[14:10:10] <apergos>	 arnaudb, pmiazga, thoughts?
[14:11:51] <pmiazga>	 I wouldn’t touch db13 for now, I would make sure we can restore the db12
[14:12:03] <arnaudb>	 apergos: the primary node promotion seems to be a good angle afaict 
[14:12:28] <pmiazga>	 And then once db12 gets up and running -> we could re-do the same steps for db13 -> because if breaking something even more on db12 -> the db13 would be still in the “half-broken” state
[14:13:45] <TheresNoTime>	 (plus could always drop to a single replica if db12 gets fixed, and then create a new replica from db12)
[14:16:21] <TheresNoTime>	 so "make sure replication is off to db12 and create the database there, then enable replication"? Is that the same as the recommended `set session sql_log_bin=0; create database test2wiki; stop slave; start slave;` ?
[14:16:42] <pmiazga>	 yes
[14:17:08] <pmiazga>	 But I wonder if we need to do `set global read_only = false;` before
[14:17:34] <pmiazga>	 As it may not allow to create the db there. So set `read_only` to false, do actions, then set the read_only to true, and then start slave
[14:18:30] <TheresNoTime>	 arnaudb: apergos: agree? if so, who wants to run that on db12 (and will they need to do ^ ref. read_only)?
[14:18:47] <apergos>	 stop slave everywhere. just to be sure.  then turn off writes to the binlog on the primary.  just for safety. then enable writes on db12. then  create the database on db12. then turn off writes on db12.  then enable writes to the binlog on the primary. then start replication to db12.        I think. ( arnaudb ?)  
[14:21:45] <arnaudb>	 I'm more favorable to the role (replica → primary) swap in this context, will have to leave you for a meeting, ttys
[14:22:58] <apergos>	 thank you for your looking into it and your advice!
[14:23:17] <TheresNoTime>	 in which case, following https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases to bring up a new replica from scratch seems safest (as we can't trust the state of the two replicas, correct?)
[14:25:15] <apergos>	 arnau db was saying I think to promote an existing replica to primary (clone and promote).   I don't know if we can trust the state of anything, but we have to choose one, primary or replica! 
[14:25:39] <pmiazga>	 I think we have `master -> slave + slave`, not multi master replication - therefore I assume that db11 has mostly up to date as it was taking all writes
[14:26:15] <pmiazga>	 But those weren’t replicated to db12, db13 -> and that’s why we have multiple phab tickets with people saying that writes do not work (because we use slaves for reading, and this data wasn’t replicated)
[14:26:48] <pmiazga>	 If we want to promote existing slave to master probably we should do a mysqldump of primary - just in case
[14:27:37] <apergos>	 yeah I would just looking into space
[14:27:44] <apergos>	 not quite enough for it with any margin
[14:27:46] <TheresNoTime>	 db11 (the primary) is the only host with current data, the two replicas are in inconsistent states 
[14:28:32] <apergos>	 well
[14:28:53] <apergos>	 clone primary into a replica and see if it runs -->  nothing gets broken, so I guess let's do that
[14:29:05] <apergos>	 if it goes awry we lost time but nothing else
[14:29:11] <pmiazga>	 Also, it’s already mid day - at this moment most likely we won’t get it working by the evening in the EU - if we aren’t sure the side effects of our actions - it shouldn’t be a problem to have it broken for little bit more.
[14:29:50] <pmiazga>	 Cloning primary into replica -> sounds good to me, it shouldn’t cause any problems
[14:30:23] <TheresNoTime>	 we've been saying clone, but afaik you can't actually clone VMs right..?
[14:30:32] <apergos>	 40 gigs of stuff, I dunno how long that takes to clone tbh  (plus setup of the new instance) 
[14:30:51] <pmiazga>	 Also, as I said - it’s already broken from Thursday EU evening, so far we didn’t get a massive complaints - couple phab tickets
[14:31:20] <apergos>	 yeah I am using the woord "clone" losely here, just meaning xtrabackup basically
[14:31:36] <TheresNoTime>	 or by clone do we mean https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases#Creating_new_instances ... ah right okay :)
[14:33:03] <apergos>	 i.e. set up new instance, use xtrabackup (aka innobackupex) to get data from primary to the new instance, I dunno how long that might take for 40 gb of stuff
[14:33:06] <TheresNoTime>	 So the summarise, let's start the process of bringing another replica up, to clone from the primary. I am going to start this process now (the initial setup of `deployment-db14`) unless anyone would rather do it
[14:33:39] <apergos>	 fine by me  (note in 30 min I have a meeting :-( )
[14:33:50] <apergos>	 no. in an hour!
[14:34:23] <TheresNoTime>	 ack, I am going to start, following the guide ^ 
[14:35:01] <TheresNoTime>	 !log deployment-prep, starting the creation of `deployment-db14` per https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases for T358329
[14:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[14:35:04] <stashbot>	 T358329: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329
[14:56:46] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9576546 (10ArielGlenn) Just to explain to folks who might be following along, what's happening: the primary server (db13) will be cloned (via mariabackup --innob...
[14:59:46] <TheresNoTime>	 !log deployment-prep, `root@deployment-db14:~# /opt/wmf-mariadb106/scripts/mysql_install_db --user=mysql --basedir=/opt/wmf-mariadb106 --datadir=/srv/sqldata` T358329
[14:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[14:59:50] <stashbot>	 T358329: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329
[15:00:33] <roy649>	  https://phabricator.wikimedia.org/T336504 is causing widespread problems.  Is there any chance last Thursday's enwiki deployment could be rolled back as a stop-gap measure until a real fix can be deployed?
[15:01:59] <TheresNoTime>	 apergos: I am at step https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases#On_Existing_DB (to be run on db11, the primary) — should I set this as readonly/stop mysqld first or does it not matter that it'll clone while transactions are ongoing as it'll then catch up?
[15:02:37] <TheresNoTime>	 (the guide does not mention readonly/stopping, so in absence of opinion I'll continue as-is)
[15:02:54] <apergos>	 um
[15:03:07] <apergos>	 (sorry, I was talking about this bug on slack with piort)
[15:03:27] <apergos>	 I would set readonly and stop all replication just to be very sure
[15:03:34] <TheresNoTime>	 ack
[15:03:41] <apergos>	 *piotr
[15:04:03] <taavi>	 thcipriani, jeena, brennen: ^ for roy649's question
[15:05:14] <TheresNoTime>	 !log deployment-prep, db11, `root@BETA[(none)]> set global read_only = true;` T358329
[15:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:05:17] <stashbot>	 T358329: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329
[15:05:53] <roy649>	 There have been 6 different threads opened on the village pump about this.  All currently consolidated under https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Weird_issue_where_Vector_2022_is_being_forced_on_a_single_page.
[15:08:10] <hashar>	 sandeep: success
[15:09:45] <TheresNoTime>	 !log deployment-prep `root@deployment-db11:~# mariabackup --innobackupex --stream=xbstream /srv/sqldata --user=root --host=127.0.0.1 --slave-info | nc deployment-db14 9210` T358329
[15:09:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:10:35] <TheresNoTime>	 !log deployment-prep prev. command resulted in `2024-02-26 15:06:43 0 [ERROR] InnoDB: Operating system error number 24 in a file operation.` T358329
[15:10:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:10:38] <stashbot>	 T358329: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329
[15:11:39] <TheresNoTime>	 !log deployment-prep, db11, `root@BETA[(none)]> set global read_only = false;` while I figure out mariabackup error T358329
[15:11:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:14:36] <pmiazga>	 roy649: it’s not a code change, cannot be reverted.
[15:14:57] <roy649>	 what was it, if not a code change?
[15:15:21] <roy649>	 Obviously something changed on Thursday
[15:16:34] <apergos>	 TheresNoTime:  try setting the ulimit to 2048 for open files, it is currently to 1024 afaict
[15:16:36] <roy649>	 I'm not trying to be a pain here.  I've got plenty of time in the trenches doing SRE/RelEng/etc, so I do feel your pain
[15:16:53] <TheresNoTime>	 apergos: I was going to do `--open-files-limit=1024` but can try that instead
[15:17:06] <apergos>	 I think you can edit /etc/security/limits.conf and add that value 
[15:17:08] <roy649>	 but whatever changed is causing a lot of problems for a lot of editors on enwiki
[15:17:15] <pmiazga>	 roy649: - We tried to add a new wiki on beta cluster -> due to broken SQL migrations in Math/Linter extension the DB schema was partially created. We fixed the `addwiki.php` script but because database was corrupted the easiest way to fix the wiki was to drop the newly created database for test2wiki. But that corrupted the replication process.
[15:17:16] <apergos>	 then you should make sure the innodb backend also has that new limit
[15:17:35] <TheresNoTime>	 pmiazga: I think you may be talking cross-issues
[15:17:45] <thcipriani>	 pmiazga: I think I think you might be talking about different issues ^ yeah :)
[15:17:54] <pmiazga>	 But the the main question roy649 - where are the issues ->
[15:17:57] <TheresNoTime>	 T336504 is what roy649 is discussing afaik :)
[15:17:58] <stashbot>	 T336504: Transcluding Special:Prefixindex can force the default skin - https://phabricator.wikimedia.org/T336504
[15:18:00] <pmiazga>	 Because we broke only betacluster
[15:18:02] <apergos>	 roy649:   we're working on a cloud db issue, not the production issue,  sorry! 
[15:18:27] <apergos>	 (and we need to fix the cloud db issue so we can get testing over there working again)
[15:18:38] <thcipriani>	 roy649: thank you for the report, it does seem odd that a bug that's been around since late 2021 only started causing issues for users last Thursday (which is how I read the bug attached to the village pump discussion). Lemme see if I can get that more attention.
[15:19:16] <roy649>	 thcipriani thanks.  If you could update the phab ticket when you know more, that would be appreciated.
[15:19:42] <thcipriani>	 roy649: I'm going to try to have this disccussion on the ticket
[15:19:44] <apergos>	 open_files_limit  in the mysqld section of my.cnf will be how you get the open file limit raised for innodb, TheresNoTime
[15:19:53] <TheresNoTime>	 ack
[15:20:27] <roy649>	 And I certainly appreciate that there's other fires that also need putting out.  BTDT, glad I don't carry a pager any more :-)
[15:20:36] <apergos>	 I'll uh see if the people in my meeting don't mind if I am mostly in here trying to follow along/pitch in where needed, it's a triage meeting in any case (in 10 min)
[15:21:29] <taavi>	 thcipriani: iirc we've seen a few repeats of a same symptoms appearing due to a slightly different cause. so this one is almost certainly due to something in the last train
[15:22:12] <thcipriani>	 ^ jeena brennen when you're around, we might need to get this one figured out before we can do anything wmf.20 related
[15:22:56] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Release, 10Train Deployments: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437#9576648 (10thcipriani)
[15:23:19] <thcipriani>	 (might need to roll back on a Monday, which seems...unideal :\)
[15:23:38] <James_F>	 Indeed. :-(
[15:24:41] <pmiazga>	 TheresNoTime: anything I can help?
[15:25:22] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Edits not saved on beta cluster - https://phabricator.wikimedia.org/T358364#9576663 (10Jdforrester-WMF)
[15:25:36] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9576665 (10Jdforrester-WMF)
[15:27:06] <apergos>	 at risk of speaking for someone else (cough), I think it's ok for the moment, I'm in here too and the latest error was hitting the ulimit of too many open files, a very low limit by default as it turns out 
[15:27:12] <apergos>	 pmiazga: ^^
[15:28:05] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Release, 10Train Deployments: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437#9576670 (10thcipriani) 05Resolved→03Open Reopening until we figure out what to do with T358304
[15:28:39] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9576677 (10TheresNoTime) >>! In T358329#9576594, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-releng), href=https://sal.toolforge.org/log...
[15:29:00] <TheresNoTime>	 apergos: https://phabricator.wikimedia.org/T358329#9576677, I am at my limit of knowledge and don't want to guess 
[15:29:08] <apergos>	 looking
[15:29:50] <apergos>	 you increased the limit in my.cnf too, TheresNoTime?
[15:30:25] <TheresNoTime>	 apergos: no, wasn't listed in `/etc/my.cnf` and the error went away after the `ulimit` change
[15:30:29] <TheresNoTime>	 do you think that's related?
[15:30:51] <apergos>	 I don't know what the innodb default value is so I would add it and try again just to rule that out
[15:30:57] <James_F>	 I'd guess that the test2wiki binlogs are corrupt and won't apply.
[15:31:01] <apergos>	 or see what the default value is
[15:31:50] <pmiazga>	 I don’t know if we can skip the test2wiki entirely
[15:32:09] <pmiazga>	 If we cannot replicate it I’m fine, just need to drop it somehow
[15:32:11] <James_F>	 If we use root to drop the entire DB (again) on the primary?
[15:32:21] <TheresNoTime>	 can try that, one moment
[15:32:41] <James_F>	 `DROP DATABASE` considered harmful, who'd have thought?
[15:32:57] * pmiazga hides
[15:34:21] <TheresNoTime>	 apergos: dropping test2wiki lets it get past that error, but then back to open files error. Will do your suggestion (`open-files-limit = 4094` in /etc/my.cnf under [mysqld], then restart mysql?)
[15:34:56] <apergos>	 yes
[15:35:03] <TheresNoTime>	 !log deployment-prep db11 `root@BETA[(none)]> drop database test2wiki;`
[15:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:36:10] <TheresNoTime>	 !log deployment-prep db11 set `open-files-limit = 4094` in `/etc/my.cnf`, then did `root@deployment-db11:~# systemctl restart mysqld`
[15:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[15:37:17] <apergos>	 I don't know that this will fix the new errror but it will at least be one issue ruled out
[15:37:23] <apergos>	 then we'll just keep going
[15:37:49] <TheresNoTime>	 same error, shall I increase to something higher?
[15:38:30] <apergos>	 no
[15:38:37] <apergos>	 let's see what we might do next.
[15:39:48] <TheresNoTime>	 my mistake, I did try `40940` and it worked (no file open errors) but then `mbstream: Can't create/write to file './mysql/innodb_index_stats.ibd' (Errcode: 17 "File exists")`
[15:40:32] <TheresNoTime>	 (that error on db14 fwiw)
[15:42:03] <TheresNoTime>	 but up to that point, it *was* copying files correctly. Can delete `/srv/mysql/innodb_index_stats.ibd` on db14 and try again?
[15:42:10] <apergos>	 let me see
[15:42:16] <TheresNoTime>	 ack
[15:43:18] <apergos>	 let me get on db14 and just look around a bit ok?
[15:43:47] <TheresNoTime>	 sure thing, all yours
[15:45:27] * brennen reads scrollback
[15:45:41] <James_F>	 brennen: Beta Cluster problems.
[15:45:59] <James_F>	 brennen: And, separately, a prod preferences issue from last week.
[15:46:21] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9576755 (10TheresNoTime) >>! In T358329#9576677, @TheresNoTime wrote: > I would now be clutching at straws to diagnose this, so for the next person who takes a l...
[15:48:03] <apergos>	 (sorry, trying to explain in the meeting why it's useful to get this fixed sooner rather than later)  
[15:48:05] <apergos>	 ok so 
[15:48:31] <apergos>	 mariabackup --innobackupex  etc is not runnig there now and no mariadb service either, right?
[15:48:57] <TheresNoTime>	 `mariabackup` is not running on db11 (primary), and mariadb is not running on db14
[15:50:05] <apergos>	 ok
[15:50:39] <apergos>	 on db14, mariadb should not be running
[15:51:11] <TheresNoTime>	 correct
[15:51:35] <apergos>	 ariel@deployment-db14:~$ ps axuww | grep maria
[15:51:35] <apergos>	 mysql      12108  0.0  2.5 13926800 418284 ?     Ssl  15:18   0:00 /opt/wmf-mariadb106/bin/mysqld
[15:51:44] <apergos>	 but it seems to be
[15:51:54] <TheresNoTime>	 oh.
[15:53:25] <apergos>	 may I stop it?
[15:53:28] <TheresNoTime>	 please
[15:53:47] <TheresNoTime>	 damn, I guess that would explain why it couldn't replace that file maybe
[15:56:08] <apergos>	 give me a few, systemctl not doing the job
[15:57:52] <apergos>	 ok, mysql and then giving the shutdown command directly worked
[15:58:05] <apergos>	 I noticed while I was in ther ethat it already had some databases
[15:58:16] <apergos>	 cawiki cognate_wiktionary  
[15:58:31] <apergos>	 so we might need to toss those, not sure yet
[15:58:39] <TheresNoTime>	 okay, shall I try again?
[15:59:49] <apergos>	 I'm tempted to say that you should toss everything in the sqldata directory, recreate the dir with the right owner, and reinstall via  /opt/wmf-mariadb106/scripts/mysql_install_db --user=mysql --basedir=/opt/wmf-mariadb106 --datadir=/srv/sqldata
[16:00:01] <apergos>	 then update all the open file limits over there again
[16:00:07] <apergos>	 and then try again.
[16:00:12] <TheresNoTime>	 ack, will do that now
[16:00:15] <apergos>	 start clean in other words
[16:00:17] <apergos>	 ok
[16:08:49] <TheresNoTime>	 apergos: same error, (copies until it gets to `/srv/sqldata/mysql/innodb_index_stats.ibd`) — maybe worth instead creating `/srv/mysqlbackup` or something on db14 and streaming everything to there, and then copying the tables?
[16:11:15] <apergos>	 so to recap: you started    nc -l -p 9210 | mbstream -x  in /srv/sqldata on db14,     started mariabackup --innobackupex --stream=xbstream .. on db11,  started mariabackup --innobackupex on db14, chown -R mysql: /srv on db14, started mariadb, and you saw errors where exactly? on db11 on the console, or...?
[16:12:57] <apergos>	 TheresNoTime: ^
[16:12:57] <TheresNoTime>	 https://www.irccloud.com/pastebin/mzm9pVcI/
[16:13:03] <apergos>	 ok looking
[16:13:27] <apergos>	 oh it's the nc on db14 that fails right away
[16:13:38] <apergos>	 so the very first step in that process
[16:13:43] <TheresNoTime>	 well, after a little while (there's a gap while it streams)
[16:13:55] <apergos>	 ok
[16:14:25] <TheresNoTime>	 https://phabricator.wikimedia.org/P57934 is the full log from db11
[16:15:03] <TheresNoTime>	 I think I
[16:15:36] <TheresNoTime>	 *I'm going to clone to a different directory, and then move the files over afterwards (as it's only complaining about files already existing, and they either need to be replaced or not)
[16:16:05] <wikibugs>	 10Scap, 10MW-on-K8s: scap sync-world: Incorrect behavior for mw-on-k8s deployment when --force flag is used - https://phabricator.wikimedia.org/T358500#9576888 (10dancy)
[16:17:12] <apergos>	 oh it's the specific file,  I see
[16:17:52] <TheresNoTime>	 syncing to `/srv/sqldatabackup` on db14, working fine (then when its finally done, someone can figure out how to merge that data into `/srv/sqldata`)
[16:19:34] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Scap, 10MW-on-K8s: scap sync-world: Incorrect behavior for mw-on-k8s deployment when --force flag is used - https://phabricator.wikimedia.org/T358500#9576962 (10dancy)
[16:19:37] <TheresNoTime>	 puppet is restarting mysqld on db14 isn't it (:
[16:19:45] <apergos>	 ah is that the issue
[16:19:50] <apergos>	 well that can be fixed
[16:20:08] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Scap, 10MW-on-K8s: scap sync-world: Incorrect behavior for mw-on-k8s deployment when --force flag is used - https://phabricator.wikimedia.org/T358500#9576963 (10jnuche) 05Open→03In progress a:03jnuche
[16:21:02] <apergos>	 puppet now disabled
[16:21:07] <apergos>	 I'll stop mysqld again
[16:21:29] <apergos>	 done
[16:21:34] <wikibugs>	 10Project-Admins: Create project tag for <Future of the Wishlist> - https://phabricator.wikimedia.org/T358505#9576966 (10SOzenogu-WMF)
[16:21:41] <apergos>	 and that would be why the file exists.... 
[16:21:45] <apergos>	 care to have one more go?
[16:22:27] <apergos>	 why does puppet do that, I thought new.... oh. this is not prod, so maybe puppet autostarting the server is desired behaviour...
[16:23:57] <apergos>	 TheresNoTime: 
[16:24:06] <TheresNoTime>	 ack
[16:24:24] <wikibugs>	 10Project-Admins: Create project tag for <Future of the Wishlist> - https://phabricator.wikimedia.org/T358505#9576993 (10SOzenogu-WMF) Hi, my name is Susan, and I'm supporting CommTech team as Technical Program Manager for the Future of the Wishlist Project. Please grant me access to create this Project Workboar...
[16:26:30] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Scap, 10MW-on-K8s, 10Patch-For-Review: scap sync-world: Incorrect behavior for mw-on-k8s deployment when --force flag is used - https://phabricator.wikimedia.org/T358500#9576997 (10CodeReviewBot) jnuche opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge...
[16:30:30] <brennen>	 re: prod preference issue on T336504, i'm available to do a rollback if needed, but would sure rather not.  as t.hcipriani mentioned i guess we try to sort that one out on the task.
[16:30:31] <stashbot>	 T336504: Transcluding Special:Prefixindex can force the default skin - https://phabricator.wikimedia.org/T336504
[16:30:51] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Scap, 10MW-on-K8s, 10Patch-For-Review: scap sync-world: Incorrect behavior for mw-on-k8s deployment when --force flag is used - https://phabricator.wikimedia.org/T358500#9577037 (10CodeReviewBot) jnuche merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge...
[16:31:06] <apergos>	 I see many files being copied which seems like a good thing
[16:31:34] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Scap, 10MW-on-K8s, 10Patch-For-Review: scap sync-world: Incorrect behavior for mw-on-k8s deployment when --force flag is used - https://phabricator.wikimedia.org/T358500#9577041 (10jnuche) 05In progress→03Resolved
[16:32:18] <hashar>	 TheresNoTime: I am going to rebase your patch to have lag.php to return a meaningful message https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1006530
[16:32:29] <hashar>	 cause I have a local patch covering that with some test cases :]
[16:32:37] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9577050 (10ArielGlenn) In the end @TheresNoTime figured it out: puppet was starting mariadb automatically when we didn't want it running and hence creating that...
[16:36:29] <TheresNoTime>	 okay, trying the next step, `mariabackup --innobackupex --apply-log --use-memory=10G /srv/sqldata`
[16:36:59] <TheresNoTime>	 noting this has the note "before starting mysqld"
[16:37:06] <apergos>	 yes
[16:37:21] <apergos>	 puppet still disabled, you should (cross fingers) still be ok as far as that goes
[16:37:30] <TheresNoTime>	 `[00] 2024-02-26 16:37:06 Last binlog file ./deployment-db11-bin.000088, position 222576711 completed OK!` :)
[16:37:40] <apergos>	 ok that's excellent
[16:37:47] <apergos>	 log things? :-)
[16:38:17] <TheresNoTime>	 !log deployment-prep, db11, `mariabackup --innobackupex --apply-log --use-memory=10G /srv/sqldata` T358329
[16:38:19] <wikibugs>	 10Continuous-Integration-Infrastructure, 10SRE, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577063 (10LSobanski) a:03Dzahn
[16:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[16:38:20] <stashbot>	 T358329: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329
[16:38:42] <TheresNoTime>	 I am now going to start mysqld
[16:38:58] <TheresNoTime>	 I don't think the systemd service was enabled, right?
[16:39:12] <apergos>	 I hm stopped it but did not disable
[16:39:13] <TheresNoTime>	 oh wait it was, not using the `mysqld`
[16:39:16] <TheresNoTime>	 *name
[16:39:17] <apergos>	 so I thin kit is not "enabled"
[16:39:29] <apergos>	 maybe
[16:39:34] <apergos>	 try this:
[16:39:56] <apergos>	 enable puppet, run it
[16:40:02] <apergos>	 check that mariadb (mysqld) is running
[16:40:05] <TheresNoTime>	 oh I just did root@deployment-db14:/srv/sqldata# systemctl status mariadb.service
[16:40:16] <TheresNoTime>	 er, not status, start
[16:40:26] <apergos>	 ok well that's fine too but let's make sure that uh puppet runs 
[16:40:28] <TheresNoTime>	 shows active (running)
[16:42:18] <apergos>	 gee I wonder which is the error log >_<
[16:43:21] <TheresNoTime>	 okay puppet re-enabled and runs ok
[16:43:34] <apergos>	 good
[16:43:41] <apergos>	 I'm uh stilll hunting around for the dang error log 
[16:43:45] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Release, 10Train Deployments: 1.42.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T354438#9438192 (10brennen)
[16:44:22] <TheresNoTime>	 apergos: everything seems to be running okay? 
[16:44:26] <apergos>	 good
[16:44:31] <apergos>	 I just want to double check etc
[16:44:39] <apergos>	 before replication gets started back up
[16:44:46] <TheresNoTime>	 ack
[16:45:10] <TheresNoTime>	 in fact, looking at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases#Starting_Replication, the repl password is *not* in the location listed
[16:45:19] <apergos>	 heh
[16:48:42] <apergos>	 Feb 26 16:39:26 deployment-db14 mysqld[15139]: 2024-02-26 16:39:26 0 [ERROR] Incorrect definition of table mysql.event: expected column 'definer' at position 3 to have type varchar(, found type char(141).
[16:48:51] <apergos>	 I see this and want to make sure it isn't a problem
[16:48:55] <apergos>	 I got it by
[16:49:02] <apergos>	 journalctl -n 20 -u mariadb.service
[16:49:06] <wikibugs>	 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9577115 (10Clement_Goubert)
[16:49:14] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Release, 10Train Deployments: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437#9577113 (10brennen) > Reopening until we figure out what to do with T358304  I marked {T336504} as a blocker for wmf.20 - guess that should be T336504 and her...
[16:49:57] <TheresNoTime>	 apergos: I don't even know what that means :/
[16:50:30] <wikibugs>	 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#8728141 (10Clement_Goubert)
[16:50:42] <apergos>	 it means, have a coffee or a bit of tea and I'll poke around a bit
[16:51:17] <TheresNoTime>	 apergos: if you're happy to, I'm going to step away for a bit. If you get to the point where you want to enable replication, hopefully https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases#Starting_Replication helps
[16:52:10] <apergos>	 ok, we'll see if I get there :-D
[16:52:18] <apergos>	 take a breather! and thanks for eveyrthing
[16:53:35] <apergos>	 this error is something standard, present on db13 for example since Nov 17 so I can ignore
[16:54:31] <apergos>	 but. I will look into it a little anyways while replication is stopped.  maybe ten mins. then move on if there's not an obvious fixup
[16:54:45] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9577169 (10TheresNoTime) Stepping away for a bit, note for whomever, replication hasn't been started (@ArielGlenn is taking a look), and once confirmed working [...
[16:55:11] <wikibugs>	 10Continuous-Integration-Infrastructure, 10SRE, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577168 (10LSobanski) p:05Triage→03High
[16:56:16] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Release, 10Train Deployments: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437#9577185 (10brennen)
[16:57:09] <hashar>	 but why did I start looking at maintenance/lag.php and maintenance/getLagTimes.php ?
[16:57:12] <hashar>	 it is a rabbit hole
[16:59:17] <James_F>	 hashar: Flee! Before you can't.
[17:03:18] <apergos>	 fixed it: needed to run /top/wmf-mariadb106/bin/mariadb-upgrade  (with the --force option), going to document that now
[17:05:41] <TheresNoTime>	 Ah!
[17:07:17] <apergos>	 what's next, replication? 
[17:07:43] <apergos>	 let me see if I can find where the password is and update the location of that
[17:07:46] <apergos>	 in the doc.
[17:14:47] <apergos>	 so far, nada: that file apparently never existed according to git log --diff-filter=D --summary 
[17:14:56] <apergos>	 grepping that output for mysql: still nothing
[17:15:03] <apergos>	 no replica either
[17:20:39] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9577298 (10ArielGlenn) The cloning procueddure is done for db14 but we are currently hunting around for the replication password, not where the docs ( https://wi...
[17:25:36] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9577327 (10Ladsgroup) Hi 👋  I suggest stopping to touch this. I will take a look soon. Regarding databases, if you're not 100% sure what you're doing, you usuall...
[17:28:26] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9577350 (10ArielGlenn) >>! In T358329#9577327, @Ladsgroup wrote: > Hi 👋  I suggest stopping to touch this. I will take a look soon. Regarding databases, if you'r...
[17:29:25] <apergos>	 !log waiting on Amir's input as per his comment on T358329; db14 remains up but not replicating. 
[17:29:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[17:29:28] <stashbot>	 T358329: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329
[17:30:12] <apergos>	 TheresNoTime:  ^^ status update. 
[17:36:40] <TheresNoTime>	 Sounds fair!
[17:47:31] <wikibugs>	 10Continuous-Integration-Infrastructure, 10SRE, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577454 (10Dzahn) I'll go with private IP but cloud VPS doesn't really seem feasible to me.
[18:07:07] <wikibugs>	 10Continuous-Integration-Infrastructure, 10SRE, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint1003....
[18:07:31] <wikibugs>	 10Continuous-Integration-Infrastructure, 10SRE, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint1003.eqia...
[20:19:19] <wikibugs>	 10Continuous-Integration-Infrastructure, 10SRE, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint1004.eqiad.wmnet with OS bu...
[20:19:50] <wikibugs>	 10Project-Admins: Create project tag for Google Season of Docs 2024 - https://phabricator.wikimedia.org/T358522#9577973 (10apaskulin)
[20:20:17] <wikibugs>	 10Project-Admins: Create project tag for Google Season of Docs 2024 - https://phabricator.wikimedia.org/T358522#9577983 (10apaskulin)
[20:43:17] <wikibugs>	 10Project-Admins, 10Google Season of Docs 2024: Create project tag for Google Season of Docs 2024 - https://phabricator.wikimedia.org/T358522#9578065 (10Peachey88) 05Open→03Resolved a:03Peachey88 Created :)
[20:45:50] <wikibugs>	 10GitLab (Integrations), 10Phabricator, 10Release-Engineering-Team (Priority Backlog 📥), 10collaboration-services: Get GitLab to render `T{\d}+` in MR overviews, comments, etc. as links to Phabricator - https://phabricator.wikimedia.org/T337570#9578117 (10brennen) > We could also start using #<number> nota...
[20:58:06] <wikibugs>	 10Continuous-Integration-Infrastructure, 10SRE, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9578161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint1004.eqia...
[20:59:24] <wikibugs>	 10Project-Admins, 10Google Season of Docs 2024: Create project tag for Google Season of Docs 2024 - https://phabricator.wikimedia.org/T358522#9578162 (10apaskulin) Thanks!
[21:18:59] <Amir1>	 TheresNoTime: hi, what was the binlog file and log position when you started cloning?
[21:19:16] <Amir1>	 (and if you cloned, did you set the primary to read only?)
[21:23:30] <James_F>	 Amir1: Yes, primary is still read only.
[21:23:42] <Amir1>	 I just set it, it wasn't :P
[21:23:50] <James_F>	 It was at some point.
[21:24:00] <Amir1>	 now I need to reclone it 
[21:26:25] <wikibugs>	 10Release-Engineering-Team (Now this 🫠), 10Scap, 10serviceops-radar, 10Patch-For-Review, 10Python3-Porting: git-fat replacement/removal - https://phabricator.wikimedia.org/T279509#9578213 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/226  deploy: Fix git-l...
[21:46:10] <wikibugs>	 10Beta-Cluster-Infrastructure: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329#9578286 (10Ladsgroup) db14 should be now getting replication well and working just fine. I'll write more details on why/what/etc. tomorrow (when I'm actually bac...
[21:50:05] <RhinosF1>	 Amir1: congrats
[21:50:16] <Amir1>	 thanks
[21:50:59] <James_F>	 Amir1: I guess db-labs.php needs to be updated to list deployment-db14 and drop deployment-db12 and deployment-db13, first?
[21:51:44] <Amir1>	 yup
[21:51:51] <Amir1>	 that's all, then things should be back to "normal"
[21:55:57] <wikibugs>	 10Project-Admins: Create two tags: #essential-work and #okr-work - https://phabricator.wikimedia.org/T357321#9578316 (10Jdforrester-WMF) 05Open→03Resolved a:03Jdforrester-WMF OK, no objections for two weeks, so done.
[22:00:57] <wikibugs>	 10Release-Engineering-Team (Radar), 10collaboration-services: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517#9578338 (10Dzahn)
[22:01:01] <wikibugs>	 10Continuous-Integration-Infrastructure, 10SRE, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9578337 (10Dzahn) 05Open→03In progress
[22:04:27] <James_F>	 !log Deleting deployment-db09, decommissioned 11 months ago but never deleted in T331019
[22:04:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[22:04:30] <stashbot>	 T331019: Edits not saving on beta cluster (db replication error, corrupted table) - https://phabricator.wikimedia.org/T331019
[22:14:05] <mutante>	 jenkins is either slower than normal or has a problem
[22:55:25] <apergos>	 db11 was read only, and I did not touch that setting. so I dunno what modified it. we'll have to check on that later.
[23:11:55] <apergos>	 db13 or a new instance, whichever, ought to be recloned, presumably from the sole replica, once we know everything is really working properly again, and put back into service, also work for tomorrow though
[23:19:25] <wmf-insecte>	 Project beta-update-databases-eqiad build #74124: 15ABORTED in 45 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/74124/
[23:27:08] <wikibugs>	 10Release-Engineering-Team (Priority Backlog 📥), 10Release Pipeline (Blubber): Do not use old version numbers on Blubber README - https://phabricator.wikimedia.org/T356908#9578598 (10dduvall) 05In progress→03Resolved
[23:58:19] <wikibugs>	 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 10Jenkins: Add test for PECL package (avoid releasing package with missing files) - https://phabricator.wikimedia.org/T358536#9578652 (10Krinkle)