[10:48:58] nice to see Emperor steping up to lead our meeting [11:01:23] They want to become manager soon 😍 [11:12:00] (SystemdUnitFailed) firing: (22) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2197:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:16] btw, I made an oopsie and dropped an index when I shouldn't have in arwiki pagelinks in codfw, putting it back now [14:22:13] See- even you make mistakes! :-D [14:51:58] Amir1: I think that ticket has since been updated with an explanation: https://phabricator.wikimedia.org/T360597#9657593 [14:52:18] urandom: ah okay, I missed that [14:52:23] sorry for the ping then [14:52:27] so a kind of split-brained deployment [14:52:38] no no, it was a good catch! [14:54:00] /cc hnowlan to make sure that I'm interpreting those comments correctly [14:54:37] and/or nemo-yiannis ^^^ [14:54:55] pings for everyone! \o/ [14:55:15] hey urandom [14:56:08] heya nemo-yiannis; do I have that right, that some of the newly added nodes are/were out of step on deployment? [14:56:28] i think they were never deployed [14:56:35] but pooled [14:57:03] scap never had the new targets pulled [14:57:10] never deployed as in they weren't running the service? [14:57:19] as in scap never run against the new targets [14:57:38] puppet runs scap initially [14:58:02] but perhaps there was a race between when the targets were added and a deploy was made? [14:58:32] i don't know the internals of how puppet calls scap [14:58:45] either way, I guess we're saying that they were running an out-of-date restbase with respect to the rest of the cluster, yes? [14:59:04] yes nodes are returning different data for the same title [14:59:21] oh sorry not exactly what you said [14:59:42] it could have been the right version of restbase (if the same scap config was used to bootstrap the new nodes) [14:59:49] but they definitely serve different data [15:04:11] What would be useful is to query cassandra in 2 nodes (new and old) for the specific failing title to see whats stored urandom [15:05:04] `domain: fr.wikipedia.org, title: Fichier%3ACleopatra_poster.jpg`in page summary table [15:05:19] nemo-yiannis: we can do that, but that would be no small failure [15:06:06] also, there would be nothing to correlate new/old nodes with new/old data [15:06:24] nemo-yiannis: you do have different git shas on these machines [15:06:31] confirmed. [15:07:04] ok (i don't have access) [15:07:34] e5ed8d0f95671701df291f786f4c0972d2e72142 (old) vs 7e5e72087d8331131669babfb8f40b269c024cd7 (new) [15:07:46] oddly, the older machines have an out-of-date commit [15:08:49] is this the restbase sha, or the scap deploy one [15:08:50] ? [15:09:07] the more recent commit has a message of "noop: Force new deployment" [15:09:19] it would be the sha for the deployment repo, I think [15:09:20] yeah thats my latest commit [15:09:25] that i deployed [15:09:32] ok, not all of the machines are at the sha [15:09:48] s/the sha/that sha/ [15:13:24] That shouldn't be an issue though because there is no actual diff between those SHAs [15:14:49] what would cause restbase to use a different url? [15:19:12] * nemo-yiannis checking [15:28:08] nemo-yiannis: why don't we just depool those hosts? [15:28:27] it's an easy test [15:31:23] we've clearly established a link between when the timeouts started. the hosts in question are running different deployments (even if they should be normatively identical), and the hosts in question are exhibiting different behavior. [15:31:57] (we should probably move this to #wikimedia-sre too) [15:33:28] ok lets depool them [16:23:34] I opened https://bugs.launchpad.net/swift/+bug/2058945 and T360913 about the proxy-server failure [16:23:35] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913