[05:39:26] FIRING: SystemdUnitFailed: prometheus-mysqld-exporter.service on db1258:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:41:26] ^ I am setting that one up [06:43:28] Amir1: Let me know when around, I am ready to push x3 [09:15:40] marostegui: I'm slowly waking up, if you want to start though, I'm around [09:15:51] for now x3 is ignored right? [09:16:41] yeah [09:16:51] ok good [09:16:55] So I am going to prepare then [09:19:11] Amir1: Regarding the x3 masters, will any write happen? [09:19:18] Because I need to define the masters for the section [09:19:29] Should I put current s8 ones or I can add x3 ones? (they are RO of couirse) [09:23:11] marostegui: in dbctl it should be the s8 master [09:23:19] cool [09:23:21] doing it [09:23:26] once I do the switch, it'll start both read and write [09:24:30] Amir1: I am about to push eqiad config [09:24:55] it should be noop [09:25:01] I check the logs though [09:25:04] Can't wait to bring the site down [09:25:47] xD [09:27:40] of course some regex missing in dbctl [09:27:40] XD [09:27:42] checking [09:27:56] I told you :P [09:27:59] I hate that thing [09:28:27] (passionately) [09:29:36] ah, x1 is treated as externalloads [09:29:41] but sX and x3 shouldn't [09:29:58] x3 still should be in externalloads [09:30:06] what? [09:30:27] so flavour: external then? [09:30:32] yeah [09:30:46] and mw is changed to handle it [09:30:50] ok that's easier [09:31:49] Amir1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145834/ [09:38:18] Amir1: I am ready to push eqiad [09:38:28] let's goooo [09:38:54] interesting [09:39:00] Let me check something before [09:39:32] It looks like hosts can be in two sections, but not in the specific groups (api, vslow) [09:39:42] Either that or the diff isn't showing up that [09:40:11] Anyway, going to push and then we can see [09:40:21] externalloads can't have groups [09:40:28] that I learned yesterday [09:40:37] awesome XD [09:40:41] Pushed [09:40:43] only eqiad for now [09:41:14] Let me know if it looks good on your end before I start with codfw [09:41:25] logstash looks clean [09:41:33] ok, let me configure codfw then [09:42:01] https://noc.wikimedia.org/dbconfig/eqiad.json looks good too [09:46:52] Amir1: I am ready to push codfw x3 [09:47:24] I'm checking something [09:47:50] ok [09:47:53] I am holding for now [09:49:22] ladsgroup@mwdebug1002:~$ mwscript eval.php --wiki=fawikiquote [09:49:22] > var_dump( \MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory()->getExternalLB( 'extension3')->getConnection( DB_REPLICA, null,'wikidatawiki')->newSelectQueryBuilder()->select( 'max(rev_id)')->from( 'revision')->limit( 1 )->fetchField() ); [09:49:22] string(10) "2348443449" [09:49:29] it works [09:49:44] so I can go for codfw? [09:50:15] yup [09:50:20] ok, pushing [09:50:34] committed [09:51:38] \o/ [09:52:22] I am leaving this https://phabricator.wikimedia.org/T390530 open as we still have to disconnect everything and make the "masters" real masters [09:52:32] And also db1258 needs to go into production and db2244 is still waiting to be racked [09:54:57] Amir1: https://gerrit.wikimedia.org/r/c/operations/dns/+/1145841 [10:00:22] this should switch mw: https://gerrit.wikimedia.org/r/1145844 [10:00:47] Let's go then? [10:01:14] the scariest part is the masters, if we mess it up, the whole cluster goes kaboom [10:01:20] I double checked it [10:01:41] can you test in mwdebug first? [10:02:01] yeah [10:02:06] but someone is deploying right now [10:02:36] ok [10:48:43] s5 decreased a bit on eqiad -0.2 % FYI [11:51:35] marostegui: show processlist on db2243 now shows only queries with term store (`Wikibase\Lib\Store\Sql\Terms...`) on top of that inndodb buffer pool efficiency is going up https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-1h&to=now&timezone=utc&var-job=$__all&var-server=db2243&var-port=9104&refresh=1m&viewPanel=panel-13 [12:54:15] btullis: This has been alerting for a few days, since past week I believe: [14:26:56] FIRING: [3x] PrometheusMysqldExporterFailed: Prometheus-mysqld-exporter failed (dbstore1009:13350) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [12:54:16] [12:54:19] could you talke a look? [13:18:42] marostegui: Looking now. Apologies for the delay. [13:24:04] No worries! [13:24:45] urandom: FWIW, restbase2026 is alerting for disk space [13:31:43] Emperor: thanks [13:34:13] marostegui: It seems that it is related to this: T371049 - I'm seeing prometheus grant errors like this: SELECT command denied to user 'prometheus'@'localhost' for table `heartbeat`.`heartbeat` [13:34:13] T371049: prometheus-mysqld-exporter doesn't fully support multi-instances for pt-heartbeat - https://phabricator.wikimedia.org/T371049