[06:37:30] Api exceptions and fatals seems to be going off a lot. Seemingly timed well with mw pre sync promoting testwikis [06:39:03] FYI dancy jeena as possibly train related but I also could be completely off [06:45:06] Taking a look [06:52:07] Looks like it started a couple hours after the sync to testwikis. I can roll back though if needed [07:02:11] jeena: first alert I can see was 15 minutes after [07:02:41] It needs looking at to see why it's going off but I can't say any more because I only have what icinga and log bot say in #wikimedia-operations [07:04:06] 04:38:00 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.1  refs T314190 (duration: 35m 37s) [07:04:06] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [07:04:18] 04:50:05 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 140 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:04:58] looking into those atm. seems indeed related to the train, but it's a serialization issue so not limited to testwikis [07:07:19] Thanks taavi [07:07:30] Is rollback best approach then? [07:07:42] And is there a task to make UBN? [07:09:29] T317606 [07:09:29] T317606: PHP Notice: Undefined index: asOfTime - https://phabricator.wikimedia.org/T317606 [07:09:34] and yes, I'm in favor of a rollback [07:09:35] this is the train task https://phabricator.wikimedia.org/T314190 [07:09:46] jeena: can we rollback then? [07:09:51] taavi: thanks for looking :) [07:09:59] yes I can do that [07:10:18] * RhinosF1 is slightly concerned we went 4 hours with no one noticing but that is also a risk of automated sync [07:10:46] 3-3.5 maybe [07:17:58] according to the task the errors are also happening in wmf.28 [07:19:27] anyway rollback has finished [07:21:08] error rate looks better [07:25:31] well maybe I spoke too soon [07:26:17] jeena: taavi said it's a serialisation error [07:26:21] So it will happen on both [07:26:27] I'd expect errors to drop soon [07:26:38] Once the corrupt .1 entries have cleared [07:26:40] yeah that's what I thought initially, but then the error rate went down [07:26:44] ah ok [07:27:07] taavi: do we know how long they are cached for? [07:27:15] And can they be marked bad / purged? [07:28:31] 307 Amir1 [07:29:38] jeena: I'm not sure rollback is a good idea. The is a serialization mismatch [07:29:53] and doesn't break anythign user-facing [07:29:57] unfortunately I already rolled back [07:30:08] sorry [07:30:08] I guess it's late there [07:31:17] or do you mean I should promote to testwikis again [07:31:36] ah yeah it's midnight :P [07:31:42] no, let's leave it as is, I try to backport the serialization change later today [07:31:49] okay, thanks Amir1 [07:33:11] and there is no need to purge anything, the whole concept of error budget is useful in such cases. Specially given that it doesn't have any user facing impact (the whole CP is mostly useless if you ask me but that's for the future) [07:34:05] Wasn't sure on actua [07:34:09] Actual impact [07:34:12] That's good to know [07:46:39] Heads up folks I am planning to depool codfw this morning to complete OS upgrades on our core routers there [07:46:57] If someone could +1 the patch to do that I'd appreciate it: https://gerrit.wikimedia.org/r/c/operations/dns/+/831800 [08:13:34] I'm gonna self-merge that CR as I need to keep on schedule. [08:14:07] or sorry Jaime you'd already done so :) [08:14:09] thanks! [08:14:26] * jynus cries becase he is no one [08:15:09] lol [08:33:52] topranks: what's the failure scenario here? [08:34:05] we might need to depool MW too depending on that [08:34:18] I don't think we need to do that. [08:34:41] Each router is being done 1 at a time, internal transport links and external transit are all drained at protocol level before reboot. [08:35:01] I'll see that the box isn't doing any traffic before anything disruptive happens [08:35:10] ack [08:35:28] mentioning it as we're not yet used to reason about MW when depooling codfw ;) [08:35:37] The de-pool is sort of a precaution, however we are doing eqiad next week, so we need to validate the router-level changes and doing them 1-by-1 is ok. We've done all the other sites now and are confident enough. [08:36:04] +1 [08:36:08] Yeah it's a consideration actually. But not needed in this case I think. Cheers. [09:54:30] should I worry about ganeti exporter failures? [09:55:07] https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1 [09:57:34] mmm I'd check on the nodes that status of the exporters, if there is a problem it should be in their logs (in theory) [09:57:57] as FYI I migrated kafka-logging2002 to PKI-based TLS certs [09:58:13] I am monitoring the status of the cluster etc.. in case of troubles, it is sufficient to revert my last patch [10:12:22] kafka-logging2002 seems fine now, so we have 2/3 of the cluster using PKI certs. I didn't see any variation in traffic/consumers/etc.., but if anybody reports a consumer that doesn't work lemme know [10:12:32] (or ping the observability team) [10:15:03] nice! [10:35:44] I see 2 errors on ganeti: Certificate for ganeti01.svc.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now [10:36:03] and then TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' [10:36:38] either something happened with certs or some other deploy may had happened yesterday at 20:17:30 [10:39:35] also that daemon needs some change, because it says "active (running)" after getting a fatal exception [10:42:34] slyngs: ^ the TypeErrror sounds like a bug in the exporter, could you have a look please? [10:42:57] weird, I don't see any change neither on pupppet not apt at that time [10:43:42] if you have any other pointers for change management monitoring, I accept suggestions [10:46:30] the exporter has already been deployed for a few weeks, but I suspect partial availability in codfw exposed some bugs in it [10:47:00] still, weird that it is happening now without a change on cert or software [10:47:35] I will report it on T311288 [10:47:35] T311288: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 [10:47:53] I don't think it is an urgent issue [10:50:42] the warning has been happening for long, so that is unrelated [10:52:07] this, on the other hand, looks not great: "smartd[1122]: Device: /dev/sda [SAT], 1496 Offline uncorrectable sectors" [10:55:05] I think the error is a bug due to traffic/parameters change, will file it in that ticket [11:02:00] https://phabricator.wikimedia.org/T311288#8231960 [11:02:18] TIL how to escape ` character on phabricator [11:42:34] jynus: I'll take a look, but yeah, kinda weird without a cert change... [11:42:58] see the comment, the warning is old and unrelated to the last issue [12:36:59] phab down? [12:37:16] works for me [12:37:33] It is timing out for me.... [12:37:52] WFM [12:37:55] Works for me too, [12:38:00] marostegui: do the wikis work for you? [12:38:01] :_( [12:38:07] marostegui: traceroute? [12:38:09] cdanis: yes, and orchestrator too [12:38:15] or 5XX? [12:38:16] interesting :) [12:38:34] it could be a network issue, then [12:38:42] it is now back [12:38:58] I got stuck at 8 ft.mad05.atlas.cogentco.com (130.117.14.202) 31.948 ms 41.776 ms 31.611 ms apparently [12:38:59] let's check NEL to see if widespread [12:39:08] phab goes through cp-text [12:40:19] I see an increase in timeouts from ES country [12:40:43] marostegui: please tell me your ISP in private [12:41:11] done [12:42:05] main isues from yoigo and vodaphone, so maybe ISP wide issue? [12:42:17] I will compare with other contries [12:45:44] france doesn't have the same spike, so probably not dc-specific [12:46:35] you hit drmrs, right, marostegui? [12:46:48] yeah [12:52:42] looks to me like a regional connectivity issue: https://logstash.wikimedia.org/goto/d782bb5aa98e535a6147107a3fd5911f [12:53:39] lots of orange and telefonica de españa reports, mostly [12:53:57] that's me yep [15:02:13] topranks: sorry to bother you- in theory, the alert that went off earlier recovered already, right? [15:02:38] I think it stuck on victorops and want to confirm before manually resolve it [15:04:15] jynus: no probs yes it's 100% recovered [15:04:41] possibly because I downtimed the host after it fired the recovery never propagated to victorops [15:04:46] thanks, I was told that sometimes we have to solve them manually and wanted to be sure [15:04:49] But you are ok to resolve it [15:04:55] yeah, that is my understanding too [15:04:55] cool thanks :) [15:04:59] thanks to you! [15:41:55] Is it possible to use debdeploy to downgrade packages? [15:43:08] is the downgraded version in apt.w.o? or you have to install from local cache? [15:43:16] cc moritzm for authoritative answer :) [15:43:40] Ah no, you're right. I removed the downgraded version from apt.w.o [16:04:32] today has been a "fun" day (I think you would agree :-() [16:05:02] hopefully your turn will be more boring [16:05:30] details at https://docs.google.com/document/d/1YfRdiL2D0iCkkEnb12iq5IXbq8jAseMhhcXMu1G6LSA cheers [16:05:37] I am writing my own incident report now and we can discuss it in the next onfire review. Situation: largely resolved. [16:06:36] btullis: feel free to contact onfire to schedule it, and don't forget to invite more people from your team, too! [16:06:59] jynus: Thanks. Will do. [16:26:13] btullis: yeah, the old packages are still in the local apt cache, so a downgrade will work fine [16:27:43] moritzm: Thanks. [18:01:56] btullis: not sure if known, as I can't find anything in phabricator, but FYI the MegaRAID Icinga alert for an-worker1079 it's flapping since Sep. 3rd AFAICT [18:06:31] Thanks. I tried giving it 3 months of downtime earlier today, but it didn't stick. Sorry, it's really annoying and we have a bunch of these servers all failing with RAID batteries at the same time. If you could downtime it for me please, I'd be grateful. [18:06:58] It's about to be decommed. [18:09:25] btullis: I don't see any downtime nor a SAL !log for the donwtime cookbook for it, where/how did you downtime it? [18:25:45] Just in Icinga. [18:42:16] btullis: it should be done, lmk if it doesn't work for some reasons: [18:42:16] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=an-worker1079&service=MegaRAID [18:51:48] Thanks ever so much. [19:13:36] np [20:04:27] <_joe_> that doesn't seem right ^^ [20:04:47] <_joe_> cwhite: can you check maybe? [20:08:41] hmm [20:11:19] _joe_: how often does sirenbot recheck? [20:13:39] there we go