[01:06:49] verbose write-up of root cause analysis regarding memcached traffic doubling (rzl, thcipriani) - https://phabricator.wikimedia.org/T310532#8004500 [01:48:07] wow, nice clear writeup Krinkle, well done! [02:15:12] ^^ seconded [03:10:59] yeah, thanks for that work on top of the investigation! much appreciated [07:32:43] even I can understand that, nicely done :) [07:59:06] * volans lost his IRC bouncer during the night. If there was any direct ping to me please ping me again ;) [08:04:01] * jbond also lost there bouncer [08:11:25] so you both missed the extensive over night discussion how to completey redesign spicerack and move it to golang. too late I guess :-) [08:18:42] great, so you're taking over I gather... all yours :-P [08:19:03] hehehe :P [08:35:48] At least you missed the bit where we discussed getting an intern to port all of Puppet to cfengine as a six-week project [08:47:23] * jbond has shudders with memories of running cfengine1 on FreeBSD 4 [08:53:29] There are fates worse than "no automation at all" [08:55:25] :D [10:27:19] Krinkle: I was wondering if we can add es1, es2, es3, es4 and es5 to https://noc.wikimedia.org/db.php somehow [11:32:18] probably ;P [11:33:24] It looks like it mostly focuses on sections... So probably just needs a bit of handling for the external stuffs [11:34:12] it would be useful to have it there if it is not lots of work [12:38:11] Hello! hello! we're going to start the Netbox upgrade now, please refrain from using it either directly or via cookbook (makevm, decom, provision, etc) if you have a doubt feel free to ask [12:46:57] good luck :) [13:44:26] XioNoX: o/ just to double check, reimages don't touch any netbox data right? [13:44:36] I am relatively sure but better to double check [13:45:18] elukey: actually they run the puppetdb script at the end [13:45:35] netbox is almost in a good state, if you could wait a bit more we should be able to give the all clear soon [13:45:56] volans: only because it is you [13:45:58] :) [13:46:03] <3 [14:11:28] elukey: you can go now with the reimage, please be our beta-tester [14:11:29] :D [14:15:12] started! [14:42:52] elukey: all good with the reimage? [14:44:36] volans: yep it is about to finish afaics [14:44:57] ok, waiting for the last bit that is the netbox interaction :D [14:45:00] finger crossed [14:51:07] volans: I have a mid-process reimage from this morning on ms-be1059 waiting for me to tell it the system has managed to boot into a vanilla Deb11. Since I've _finally_ managed to get the installer to work, am I OK to restart this reimage? [14:52:35] volans: I think it may fail to validate the puppet run since I have an issue with cassandra [14:54:29] elukey: ack [14:55:15] Emperor: yes, go ahead [14:56:19] thanks :) [14:57:56] jbond, volans and I are happy to announce that Netbox got successfully upgraded. You can now resume using it, as well as cookbooks. [14:59:49] nicely done XioNoX jbond volans ! [15:12:00] XioNoX: congrats! [15:13:43] congrats to everyone involved, obviously :) [15:13:48] super excited to see this [15:29:05] XioNoX, jbond, volans: awesome work, very exciting! [15:41:16] new netbox is far easier on eyeballs, i said this during its use in nextbox next [15:41:18] but worth repeating [15:41:29] phabricator upgrade in progress [15:41:39] yeah, i went to use and couldn't saw the window =] [15:41:51] * robh just reads email and drinks coffee for next 20 min [15:42:12] as a note for the future: during planned maintenances it might be nice if we served an error page other than the default envoy one [15:42:16] 50% of what I do requires netbox, 100% requires phab, heh [15:42:28] the default error page rather than maint is concerning for a moment [15:42:33] cdanis: +1 [15:42:49] would be much nicer to have a default maint page that points to the wikitech schedule or something [15:43:10] heh, our error pages used to just say to give us more money ; D [15:43:21] we have too many layers emitting different error pages now :) [15:43:40] arguably we shouldn't ever emit a default envoy error page, or one not generated at the edge? [15:43:43] true, it used to be nothing was cached but wikprojects so everyuthing could be custom [15:43:55] "upstream connect error or disconnect/reset before headers. reset reason: connection failure" is concerning if you dunno [15:44:13] but anytime i see error my first reaction is check deployment schedule [15:51:19] cdanis: there used to be one [16:25:39] Looks like we had spike of requests to Thanos' range query API, https://[10.2.2.53]:443/-/ready [16:26:09] which caused it to fail network probes, anyone know where to find logs of thanos' queries during that time period? [16:33:15] RhinosF1: phab maintenance window closed [16:33:41] well done mutante and others [16:42:01] indeed! nice job mutante and everyone else [17:01:27] I could use a confirmation from someone that the following is an accurate summary of https://wikitech.wikimedia.org/wiki/Incidents/2022-05-09_exim-bdat-errors [17:01:28] > During five days, about 14,000 incoming emails from Gmail users to wikimedia.org were rejected and returned to sender. [17:01:45] in particular the to/from and the fact that the failure was not silent to the sender [17:06:14] marostegui: can you file a task for that idea? (/me created a #noc.wikimedia.org tag just now) [17:13:46] Krinkle: I can help, can you expand on your question? [17:15:28] this is the code that generates that page: https://github.com/wikimedia/operations-mediawiki-config/blob/e697f0365f08748fff82415ae9381ec6fadf32e6/docroot/noc/db.php [17:15:36] this is the data it reads from https://noc.wikimedia.org/dbconfig/eqiad.json [17:16:05] jhathaway: I'd like to make sure I've correctly inferred from the gdoc and task that 1) this is indeed about incoming email, not outgoing, and 2) incoming from gmail.com only (and other google mail), and 3) that it wasn't a silent failure, e.g. no mails lost without the sender knowing. [17:16:07] somewhere es1 through es5 should be added [17:16:44] Krinkle: I believe all of those statements are true [17:17:34] with the caveat that how a individual email provider handles 5XX errors is unspecified to my knowledge [17:17:52] gmail presents a nice bounce message, but I didn't survey other providers [17:17:57] ack, I imagine in some cases they might retry later before giving up and telling the sender? [17:18:14] they shouldn't since a 5XX error is a hard error [17:18:15] the mention of 503 confuses me as I'm not used to seeing that in context of... email. [17:18:36] but maybe I wrongly autofixed that from 503 to HTTP 503 [17:19:05] it's a SMTP error code too :) [17:19:42] in this case Exim is trying to tell the client that they are sending invalid SMTP error codes [17:19:56] ag, codes, not error codes [17:19:56] 503 valid RCPT command must precede DATA email [17:20:25] ack, thanks. I've uncorrected my edit [17:20:34] "503 Bad sequence of commands" <- from rfc2821 [17:22:47] Krinkle: anything else I can help clarify? [17:23:16] jhathaway: perhaps, there's one other incident I'm not sure of the impact. https://wikitech.wikimedia.org/wiki/Incidents/2022-05-24_Failed_Apache_restart [17:24:04] I'm trying to determine whether wiki users were noticably impacted. it seems at least a portion of mw servers got affected, but given the severity (presumably 50x error?) I'm guessing they were quickly depooled and/or retried at edge [17:24:35] by quickly I mean automatically by pyball for appserver lvs [17:24:50] I believe some portion of logged in users were affected, but I'm not sure if we have a ballpark percentage [17:29:46] not seeing any obvious dips at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=1653390859763&to=1653396547151&orgId=1 [17:30:12] thre's a slight dip in http 200 but hard to be sure [17:30:32] Krinkle: I think there was very little impact to wiki users [17:30:50] also looking at https://w.wiki/5Hp4 and https://w.wiki/5Hp5 [17:35:30] ack, I've now got: [17:35:30] > For 35 minutes, numerous internal services that use Apache on the backend were down. This included Kibana (logstash) and Matomo (piwik). For 20 of those minutes, there was also reduced MediaWiki server capacity, but no measurable end-user impact for wiki traffic. [18:05:12] Krinkle: it is my plan to try to classify outages at https://docs.google.com/spreadsheets/d/1EYbMt6xTCDBaWfrPgu8Z1a3CvYrxbH1uH4kVf8MOQfQ by type (outage, data loss, etc.) and impact (internal, partial outage, full outage, etc.), would that be helpful for you?