[07:39:20] FYI: netbox will be briefly unavailble in ~5m due to a reboot [07:56:38] Netbox is back [08:39:56] effie: ready fo rthe next round? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/918388 [08:41:59] headsup: cumin2002 will be rebooted in 20m [08:52:32] I'm seeking a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/918387 for kind souls [08:53:36] godog: why they need to be hardcoded? [08:54:54] volans: historical reasons [08:55:15] still valid historical reasons? [08:56:05] maybe, I haven't checked [08:56:32] godog: sorry, can't +1, gerrit migration in progress [08:56:57] hah excellent timing, thank you anyways volans [08:57:07] I'll do when it's back [08:57:16] you can't deploy it either anyway :D [08:57:41] heheh true, or look at the patch [09:06:58] duesen: I think we can go ahead [09:09:55] effie: nice! Right now? [09:10:42] go for it, I will open my graphs in the meantime [09:10:51] let me +1 you [09:13:33] cumin2002 is back [09:19:24] effie: waiting for your +1 [09:22:07] sorry I got distracted [09:29:55] effie: scap is doing a *massive* amount of git fetches. I have not seen that before. [09:30:10] hmm [09:30:16] Seems to be running ok though [09:30:45] All the submodules, I guess? [09:35:49] I am scapping so rarely, anything I would say, would be a wild guess [09:37:05] fpm restarts ongoing [09:42:20] effie: scap complete [09:43:06] effie: I'm seeing a bump in the job insertion rate [09:45:20] the increase in parsoid jobs is clearly visible as well [09:46:08] seeing about 30 jobs/sec, up from 2.5 [09:47:17] processing rate is down a bit [09:48:07] effie: does this look ok? [09:49:42] lets let the dust settle a bit [09:50:55] monitor for a while, and see how things will progress. Our first concern is if there are any alarms going off, and then the second is how much we are slowing down job processing, which we can assess soon but not yet [09:54:38] ok. i can observe changes, but I can't really tell whether or not we should be alarmed by them :) [09:58:27] going for lunch [10:01:04] cheers, will babysit the dashboards [10:44:33] Morning all. FYI I have been asked to create a temporary user account for an external organisation to access some of the DE suite of tools: https://phabricator.wikimedia.org/T336357 [10:45:44] It's definitely a legit request, but I thought I'd mention it in case anyone sees it in passing and has any queries and/or guidance. [10:46:23] I'm happy to do the work myself, but it's out of the ordinary so I just thought I'd flag it here. [11:22:42] alertmanager doesn't seem to respect profile::monitoring::notifications_enabled should I file a task? [11:24:17] the alert is SystemdUnitFailed (not like a remote job) so I think it shouldn't fire [11:24:39] but I don't know much of how those are implemented [11:26:58] <_joe_> ok, interstingly if I ddo this by hand with my account, it works [11:27:11] <_joe_> it doesn't work for sirenbot though [11:27:19] <_joe_> I need to try with its account - later [11:52:44] dcaro: can you let me know what happened with the reduction in prority to the wm_enterprise_downloader.py a few days ago? I don't see it still running but it also doesn't have the enwiki files it was in the process of writing at the time, as though maybe they were cleaned up for some reason. [11:53:28] apergos: we just disabled the paging for the alerts, and let it get as much resources as it needs from the host. So no changes in the priority [11:53:44] ok interesting, thanks [11:54:38] if you see that causing issues, then we might want to look more on trying to better use the resources, but it using all the resources that it has available seemed like a good thing xd (as long as the impact for other things was not huge, and it did not look like it) [11:55:12] I'm about to rstart it because the backfill is not complete [11:55:56] it will be running on clouddumps1001 in a screen session as the dumpsgen user [11:56:13] :+1 [11:57:19] (restarted) [12:01:55] effie: how is it looking? I'm about to get in the care and drive to another city... [12:12:53] from what I can see on https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue, insertion rate is up, processing rate is down a little, but cpu is not up - it's lower, if anything. This would be consistent with adding a bunch of i/o heavy jobs. I would have expected parsing to be cpu heavy... [12:39:41] duesen: so far so good, if anything, we will just revert the patch [12:40:19] lets do not forget that jobs are not uniform, they have all sorts of durations, intensity etc [13:21:57] <_joe_> effie, duesen why are the parsoid jobs being inserted in codfw? [13:23:25] <_joe_> almost at the same rate as in eqiad [13:23:38] <_joe_> edits should only happen in eqiad [13:23:54] <_joe_> so what other type of event causes this job to be created? [13:31:31] it is half the rate right, I am not sure to be honest [13:46:28] <_joe_> !issync [13:46:28] Syncing #wikimedia-sre (requested by joe_oblivian) [13:46:30] Set /cs flags #wikimedia-sre sirenbot +Aitv [13:46:38] first of all, the jobrunners on codfw are idle, I see tje jobs in eventgate and jobquete [13:47:07] https://w.wiki/6gj$ [13:47:27] <_joe_> effie: yes, something is inserting these jobs in codfw [13:47:33] <_joe_> not executing [13:48:35] yes, it wasnt obvious at first [13:54:35] <_joe_> sorry, testing something :) [13:56:17] _joe_: I will look into it with daniel, [13:58:32] <_joe_> oh, yeah [13:58:34] <_joe_> it works [14:23:34] jbond: thanks for the sel cookbook! [14:25:05] np :) [15:29:26] _joe_, effie: these jobs are created on edit, and on page views if the page's content isn't found in the main parser-cache. [15:29:41] <_joe_> duesen: ok that explains it thanks [15:29:41] ...sorry for the late response, I was driving and got stuck in traffic [15:29:53] <_joe_> duesen: traffic in germany? fake news! [15:30:05] _joe_: so, how to fix? [15:30:40] <_joe_> duesen: it's nothing to fix [15:30:46] <_joe_> the jobs are inserted in codfw [15:30:49] (The Autobahn is like our high speed trains - great when they actually work, but everything is under construction all the time, causing chaos) [15:31:15] <_joe_> changeprop picks up the jobs in codfw, then submits them to jobrunner.discovery.wmnet [15:31:20] <_joe_> which resolves to eqiad [15:31:21] _joe_: as long as they are still executed, it'S all good :) [15:31:30] <_joe_> yes they are, in the right DC, too [15:31:43] Excellent. Thanks for checking! [15:32:14] <_joe_> I was just expecting the job to be generated only on edit [15:32:24] <_joe_> so I was "wait are we allowing edits in codfw" [15:34:29] <_joe_> thanks legoktm for writing the patch to make sirenbot change the topic without getting opped [15:35:12] ah, is that fixed? cool! [15:43:10] woot :D [15:43:55] Using `/cs topic ...` magic I would imagine? [15:44:19] * bd808 sees that CS did the change, so yes [15:54:56] <_joe_> bd808: yeah [16:10:38] _joe_: for context: main parser cache is updated synchronously on edit, and synchronously when viewing a page with a stale cache entry. The idea was to populate/update the parsoid parser cache in the same situations, but async. [16:10:54] <_joe_> duesen: fair [16:11:17] <_joe_> and we only invalidate it on transclusions, not re-parse it? [16:28:39] _joe_: yes, just like the main PC [17:09:24] root@gerrit1001:/# ethtool eno2 | grep Speed Speed: Unknown! [17:09:37] ^ wonder why it's "Unknown!" [17:09:50] how do I actually check link speed [17:10:10] ethtool .. | grep Speed seemed to be the thing [17:12:15] ip link show ... has a "1000" at the end of the line [17:14:27] mutante: link is down? [17:15:05] sukhe: no, it's up. or gerrit would be dead now [17:15:09] eno1: [17:15:31] but you are checking eno2 in the previous message [17:15:32] oh, yes, eno2 [17:16:03] well, duh:) that was too obvious now. thanks sukhe [17:16:10] np happens :) [17:16:17] gigabit confrimed [17:27:35] echoing here as in the -ops channel it might get lost with the higher traffic: [17:28:06] the "Uncommitted DNS changes in Netbox on netbox1002" alert is due to the addition of the DNS name in this IP: https://netbox.wikimedia.org/ipam/ip-addresses/12908/ [17:28:49] and to be propagated it requires the sre.dns.netbox cookbook to be run, but from the description it says manual DNS. In that case you need to empty the DNS Name field. [17:29:48] cc arturo (that might be offline by now), not sure who from wmcs might have context on it [17:30:34] btullis: cwhite: herron: hnowlan: marostegui: ottomata: f you get a sec could you clean out yuor home drive on apt1001 [17:30:37] andrewbogott, taavi: ^ [17:33:08] jbond: sure, done [17:33:12] done [17:33:28] thanks <3 [17:35:06] volans: I think we eventually want it auto-generated from netbox, although not quite yet.. the zone files for wikimediacloud.org (or the reverse zones) don't currently reference the netbox zones iirc so both options work I guess. I don't have access to do either [17:45:08] volans, RhinosF1, that's something arturo and papaul are working on [17:47:16] jbond: done. Sorry about that. [17:47:39] btullis: thanks and no worries [18:51:10] taavi, andrewbogott: I'm removing the DNS Name for now to unblock other changes, taking for good the Description [18:52:55] volans: yes i ment the reimage. lets see what mori.tzm thinks [18:53:16] * jbond sorry wrong room [18:54:08] arturo, papaul ^^^ [18:59:54] to be clear that keeps the status quo as the record was never pushed to the authdns servers [19:17:23] jbond: done [19:51:56] volans: sorry, yes, it can be removed if required [19:52:10] I'm not in the laptop right now [19:52:25] forgot to run the cookbook [19:52:42] but also is not important, it can be done at other time [19:52:51] cc andrewbogott [19:53:36] I've already removed the record from netbox, so nothing to do there if it's handled manually [19:53:39] for now [20:49:07] I just realized that our wdqs hosts are using the 'powersave' cpu scaling governor. What's everyone else using? I'd like to change it to 'performance' unless there's a good reason not to [20:50:25] inflatador: I'm pretty sure we've had this issue on various other hosts before, and it's been changed [20:50:31] inflatador: see T225713 and T315398 for some related background [20:50:32] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [20:50:32] T315398: Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 [20:50:46] heh, volans has ze links [20:50:52] (and feel free to search for similar ones in phab) [20:52:21] https://phabricator.wikimedia.org/T328957 though depending on the actual machines.. might cause power draw issues :) [20:53:56] yeah check with dcops too [20:56:39] damn, y'all write good tickets ;) ! Something to aspire to [20:56:52] we all write crappy ones too :D [21:02:21] True enough, but these are nice. Sadly , the hosts in question are indeed R450s [21:04:03] That task is about caching sites.. And I'm guessing the wqds are in eqiad/codfw, so probably less of an issue (maybe, hopefully) [21:05:00] Yeah, definitely food for thought. I'll talk it over with my team and dc ops and see if they have opinions. Thanks for the links! [21:06:33] I also noticed that our CPU frequency governors are set to 'powersave' which I think is sub-optimal. SREs gave us some links for reference https://phabricator.wikimedia.org/T315398 https://phabricator.wikimedia.org/T328957 https://phabricator.wikimedia.org/T225713 [21:07:53] FWiW I don't think this is the key to our problem, the old chassis are using 'powersave' as well [21:12:12] oops, meant to post all this in the search room, ignore