[05:48:09] 10serviceops, 10MediaWiki-General, 10Performance-Team (Radar), 10SecTeam-Processed, 10Security: Create a tmp directory just for MediaWiki - https://phabricator.wikimedia.org/T179901 (10Joe) 05Open→03Declined Given we've in the meantime worked on moving to kubernetes, and of the work on shellbox, I do... [06:08:46] 10serviceops, 10SRE, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Aklapper) @Seddon: Could you elaborate on these bullet points, please? Thanks! [14:23:06] hi all hopefull quick question about active/passive dns discovery addresses. i.e. metafo. the wiki page states that "will always return the IP of the endpoint in the primary data center.". this suggests that the records may be tied to `WMFMasterDatacenter`. however looking at the switch dc wikitech page it seems like the process is: [14:23:27] pool passive site, wait for ttl; depool old active site [14:24:01] could anyone confirm if this is correct or is there anything elses to do [14:24:37] or asked a different way how would i move only appservers-rw to codfw without preforming an entire dc switch [14:24:52] * jbond is not planning on moving only appservers-rw this is for a new service [14:25:10] jbond: `WMFMasterDatacenter` is a mediawiki-ism, not anything at the traffic/discovery level [14:26:12] for the case of a generic a/p service I think you're better off looking at e.g `swift-rw` [14:26:28] in which case, yeah, you pool the passive site, wait, depool old active site, on swift-rw [14:27:16] cdanis: great thanks [14:28:03] the value of `WMFMasterDatacenter` in etcd is *mostly* just for the database load balancer in Mediawiki itself -- it wasn't to look at replication lag vs the primaries [14:28:09] s/wasn't/wants/ [14:28:23] there are a few other pieces of automation that happen to read it ofc [14:29:46] act thanks i wasn't sure if metfo records and gdnsd had some concept of primary DC as well and was using WMFMasterDatacenter as a step in. but it sounds like it dosn;t which is great :) [14:32:34] jbond: "primary DC" as in "the dc in whcih this a/p service is currently on" [14:33:24] ack thanks :) [14:53:50] keep in mind the metafo mechanism (for a/p) does have the concept of an ordered priority list, which affects how this works during the a failover window. [14:54:51] (if you ever have both pooled at the same time, the first one in the list at the gdnsd level "wins") [14:56:04] which means if you're switching via an intermediate state of "both pooled", it will work differently going from A->B than it does going from B->A (if A is first in the list, A always wins when both are pooled, even though in one scenario we're transitioning away from A, and the other we're transitioning *to* A) [14:58:33] I think... now even I'm questioning that, it's hard to remember all the bits about how "failoid" is integrated too. [15:01:30] failoid is used when nothing is pooled IIRC [15:01:35] to make a service fail fast [15:07:06] right [15:09:26] yeah so reviewing the output config on an actual DNS server to refresh my brain [15:10:10] both kinds (a/p and a/a) make use of a geoip map to specify per-DC addresses e.g. "eqiad => 192.0.2.1, codfw => 192.0.2.2" [15:10:40] yes i got hit by failod when i deployed. i was thinking steps 2 and 3 in https://wikitech.wikimedia.org/wiki/DNS/Discovery#Add_a_service_to_production shuld be swapped. i.e. pool the sevice before updating dns to avoid a short period of failoid [15:11:00] for the a/a cases, that's all there is. If both are pooled the traffic splits geographically, and if one is depooled everything goes to the other. If both were depooled, it's basically going to go back to acting as if both are pooled (there isn't a no-service state). [15:11:40] for the a/p cases, there's a metafo resource enclosing the geoip resource, which lists the geoip resource as the first preference and the failoid (intentionally-dead fake service) as the fallback. [15:13:00] so for those cases: if 1/2 is pooled it gets all traffic, if 2/2 are pooled they geo-split (which is no longer a/p, and might not be a desirable state for most a/p services!), and critically-different than the a/a case: if both are depooled, traffic gets shunted off to the failoid IPs instead of to one of the two depooled service IPs. [15:13:00] thanks for the explination this is really usefull [15:14:00] for my use case i think its fine to have short periods of a/a while the services is failed over [15:14:09] so for the a/p cases, I think the desired state transition from any A to B is to depool-before-pool. Otherwise you will create a temporary a/a situation for the service, which it might not be able to handle. [15:14:29] (instead, you get a temporary "all traffic goes to the failoid IP" state) [15:14:54] well i guess its a trade of, either go a/a for a bit or drop all traffic. i guess it depends on the service [15:15:24] yeah but if the service can handle a window of a/a traffic without causing data problems, etc... why isn't it just a/a all the time? :) [15:16:04] I tend to assume the reason for the a/p decision is that a/a causes problems [15:16:31] but yeah, it's a decision that *can* be made on a case by case basis [15:17:19] well in my use case of apt. i want to be able to switch services over to do work. when i do this i can: stop any new uploads; sync data ; go a/a ; go a/p (swpping the a) then enable updloads again. simlar to MW going RO for a dc switch i guess [15:17:38] yeah, I guess that makes sense [15:18:10] during the a/a period, both will be live with a geo-split, like a normal a/a service. [15:18:39] yes in this case its fine. the only reason that they cant go a/a permently is we dont sync the data often enough [15:18:40] (so ignore what I said much earlier above about A->B being different than B->A, that was just me mis-remembering how the low-level bits were laid out) [15:18:49] ack [15:35:54] https://toolhub.wikimedia.org is alive! Many thanks to everyone who helped this happen. See the announce email for more info: https://w.wiki/4DXz [15:40:19] congrats! [15:46:58] \ο/ [15:52:23] bd808: don't spoil my street cred on slack [15:52:25] :P [15:52:42] * bd808 hides from the mean, mean joe [15:55:19] huh? [15:56:50] majavah: I sent joe a :hug: emoji on slack in response to him trying to avoid the wikilove I gave him and others for helping in the Toolhub project. [15:57:39] we are just poking each other because that's how crabby old guys bond [15:57:43] majavah: I have a street cred to protect. If people realize I'm actually helpful, I won't have anywhere to hide from requests [16:21:56] 10serviceops: Productionise mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10jijiki) [16:43:13] WHat is the default job concurrnacy for high_traffic_jobs_config ? [16:44:10] And as long as 4 is higher than the default I wonder if someone could deploy this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/730846 to avoid our backlog growing [16:44:37] I'd also love a pointer toward how to get access to deploy such things [16:49:52] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review: Several jobs (incl. recentChangesUpdate, wikibase-InjectRCRecords) accumulating backlog since 2021-10-14 14:47 UTC - https://phabricator.wikimedia.org/T293385 (10Addshore) [16:52:51] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review: Several jobs (incl. recentChangesUpdate, wikibase-InjectRCRecords) accumulating backlog since 2021-10-14 14:47 UTC - https://phabricator.wikimedia.org/T293385 (10Addshore) p:05High→03Unbreak! Given the backlog is 1.4 hours and this has a user facing impac... [16:57:58] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review: Several jobs (incl. recentChangesUpdate, wikibase-InjectRCRecords) accumulating backlog since 2021-10-14 14:47 UTC - https://phabricator.wikimedia.org/T293385 (10Addshore) It looks like some other jobs are also affected and building up {F34688885} [17:01:17] addshore: it's not documented but it seems based on the source DEFAULT_CONCURRENCY is 30. Although that said, I don't think we have any jobs that don't have an explicit concurrency in the config [17:02:02] interesting! this one doesnt seem to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/730846/2/helmfile.d/services/changeprop-jobqueue/values.yaml [17:02:04] it is new though [17:03:26] I see 50 in the chart (I think https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/b5b5c2751236653b581062eeae314681542bebbf/charts/changeprop/values.yaml#31 ) [17:04:18] oops, you're totally right [17:04:35] I wasn't sure if that was overridden where that chart is actually deployed though [17:04:55] So an "increase" of concurrency to 4 is actually a decrease? [17:06:34] seems like it's the same on the live config [17:06:40] and yes aiui it would be a decrease [17:07:40] as regards how to deploy the change, jobqueue uses the k8s deployments process https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments which I believe wikidev users can do (could be wrong on that one though) [17:08:07] So it seems like a concurrancy increase here might not actually help us clear this backlog [17:09:16] But this also adds the job to `high_traffic_jobs_config` which I guess does something else too? [17:11:13] "All the jobs listed below get their own rule, which transfers to their own processing unit" and a seperate worker. [17:11:36] * addshore reads the deploy page now [17:11:37] afaict that job doesn't currently exist at all [17:11:49] It's been running for a week or so! :P [17:11:58] wait, infact, this job has existed for years I believe [17:12:35] Data in grafana since may for it, indeed [17:12:49] oh, is it under another name? [17:13:00] In grafana it is as "wikibase-InjectRCRecords" [17:14:19] ah, oops :) [17:14:27] it's caught by the mediawiki.job.* [17:14:43] so it would be under low_traffic_jobs currently [17:15:11] Does "injected less than 10 times per second" count as low traffic? [17:15:58] If not then I think lets merge and deploy this change as is! :) (I think) [17:16:10] I think it would be fair to put it into high traffic - in low_traffic_jobs aiui the concurrency of 50 applies to *all* of the jobs matched rather than each individual job matched by it [17:16:27] Right, its all starting to make sense! [17:16:53] so I don't think concurrency 4 is going to get *worse* performance :) [17:17:30] I can merge and deploy if you'd like - do you have a test to verify improvements in performance once that's done? [17:17:39] That would be great! [17:17:51] I'll be around watching the grafana dashboard after deploy to see how it works [17:19:40] Right, seemingly I might have the technical ability to do this [17:20:36] at least I can do the helmfile diff against eqiad for example :) [17:24:34] addshore: you have deployment access to mediawiki right? [17:24:40] yup [17:25:15] so you can deploy to kubernetes :) [17:25:33] Well, Today I learned! [17:26:03] addshore: would you like to try deploying in that case? :) I'm seeing the change in the diff [17:26:13] I'm happy to go ahead and do it if you'd prefer not to [17:26:18] potentially https://wikitech.wikimedia.org/wiki/Changeprop#To_Kubernetes is slightly out of date then? or is this talking about something else? [17:26:22] hnowlan: I'd love to! [17:27:00] hmm, i should be in `addshore@deploy1002:/srv/deployment-charts/helmfile.d/services/changeprop` right? [17:27:14] no! changeprop-jobquee? [17:27:16] addshore: it is out of date, I will update that [17:27:23] addshore: and yep, changeprop-jobqueue [17:27:58] diffs look good, and I guess I need to apply to eqiad and codfw? what about staging? [17:28:10] I'd do staging, then codfw, then eqiad [17:28:16] sounds good! [17:28:22] should I be logging this anywhere? [17:28:25] but for a change like this I wouldn't be too worried [17:28:37] the helmfile apply will let logged in -operations [17:28:42] gotcha! [17:35:00] all done, now to stare at the board [17:35:28] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10User-Addshore: Several jobs (incl. recentChangesUpdate, wikibase-InjectRCRecords) accumulating backlog since 2021-10-14 14:47 UTC - https://phabricator.wikimedia.org/T293385 (10Addshore) p:05Unbreak!→03High a:03Addshore Deployed, and now to stare at... [17:37:15] Am I correct in thinking the jobs that were already in the low traffic queue, will remain there and get executed by this shared low traffic pool of cpu tim? [17:37:43] but all new jobs will go straight into this faster dedicated thing? [17:49:09] shall we up the concurrancy? :P [17:49:15] bah, wrong channel... [18:09:21] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10User-Addshore: Several jobs (incl. recentChangesUpdate, wikibase-InjectRCRecords) accumulating backlog since 2021-10-14 14:47 UTC - https://phabricator.wikimedia.org/T293385 (10Addshore) 05Open→03Resolved {F34688959} [18:45:04] 10serviceops, 10MediaWiki-General, 10SRE, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 5 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Urbanecm) 05Resolved→03Open >>! In T219279#7403482, @Pchelolo wrote:... [22:07:24] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability: Remove mod_unique_id from app servers - https://phabricator.wikimedia.org/T253675 (10Legoktm) I couldn't figure out exactly what was enabling mod_unique_id in the first place. On mw1320, I see: ` legoktm@mw1320:/etc/apache2$ ls... [22:46:27] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability: Remove mod_unique_id from app servers - https://phabricator.wikimedia.org/T253675 (10Dzahn) >>! In T253675#7430025, @Legoktm wrote: > I couldn't figure out exactly what was enabling mod_unique_id in the first place. @Legoktm I... [22:46:57] legoktm: ^ installing libapache2-mod-security2 also pulls in the unique_id mod [22:47:01] ahhhh [22:52:02] mutante: thanks for figuring that out :) [22:52:57] yw, so.. this was originally from https://gerrit.wikimedia.org/r/c/operations/puppet/+/467643 [22:53:17] but now not sure if it was also in the old apache module setup [22:54:47] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability: Remove mod_unique_id from app servers - https://phabricator.wikimedia.org/T253675 (10Legoktm) Oooh, I forgot to look for that, thanks! I think that would necessitate declining this task, unless mod_security2 is also not really... [22:56:47] legoktm: https://phabricator.wikimedia.org/rOPUP646adf13b514cd4f380055a3bee9a7f4955310a3 -> https://phabricator.wikimedia.org/T132599 [22:57:05] Ori made it :) "give us the ability to ban cache objects that were generated by codfw app servers, in case codfw app servers produce mangled responses after the switch over" hmmmmmm [22:57:51] I didn't realize https://httpd.apache.org/docs/2.4/mod/core.html#servertokens was part of httpd core [22:58:57] I wonder why he picked overloading the Server: header rather than introducing a new one [22:59:08] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability: Remove mod_unique_id from app servers - https://phabricator.wikimedia.org/T253675 (10Dzahn) This was added in T132599 (https://phabricator.wikimedia.org/rOPUP646adf13b514cd4f380055a3bee9a7f4955310a3) where Ori described it as "... [23:00:11] hmmm "(beyond the limited set of choices provided by the ServerTokens directive)." [23:00:48] yeah, I linked it above [23:01:16] I guess Apache forces "Server:" to be Apache/2... unlesss you use mod_security to override it [23:02:25] also I think this is just low priority cleanup, not something that needs actively fixing [23:03:15] "If by any chance you want to remove or change this header even from the response of the reverse proxy, you will have to use mod_security," [23:03:24] ack, sounds like it [23:04:24] cleaning up mod_security2/Server: $fqdn might be useful on its own, but I don't think "Remove mod_unique_id" should be the motivation behind it [23:05:25] makes sense, the ticket just says it causes confusion but not that it's an actual performance problem, *nod [23:22:49] 10serviceops, 10Performance-Team (Radar), 10Sustainability: Remove mod_unique_id from app servers - https://phabricator.wikimedia.org/T253675 (10Krinkle) It seems that is indeed the only use, and there's a couple of workaround and bugs around this module that may be possible to remove if it too were removed....