[01:35:06] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10aaron) Here are some stats from a custom script on mwdebug (per key "collection"). ` { "overall": { "num_slots": 4099,... [09:07:21] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10hashar) In my experience it is better done during low CI traffic, start of morning in Dallas will work just fine. We would then send a... [09:34:56] 10serviceops, 10SRE, 10Developer Productivity, 10Performance-Team (Radar), and 2 others: Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10hashar) Spotted this on labweb1002 / lab1001 today, all messages referred to `127.0... [09:37:36] 10serviceops, 10SRE, 10Developer Productivity, 10Performance-Team (Radar), and 2 others: Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10hashar) I have forgot, a reqid example: https://logstash.wikimedia.org/app/dashboard... [09:53:33] 10serviceops, 10Performance-Team: Investigate performance degradation at high concurrencies in php-fpm - https://phabricator.wikimedia.org/T293630 (10jijiki) [11:17:51] kubestagetcd host seems to fail PCC, does it ring a bell / is known? [11:18:03] Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret ssl/_etcd-server-ssl._tcp.k8s3-staging.codfw.wmnet.key (file: /srv/jenkins-workspace/puppet-compiler/31918/production/src/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /srv/jenkins-workspace/puppet-compiler/31918/production/src/modules/profile/manifests/etcd/v3.pp, line: 109) on node [11:18:09] kubestagetcd2002.codfw.wmnet [11:38:01] 10serviceops, 10Community-Tech, 10SRE, 10wikidiff2, 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) [16:33:16] 10serviceops, 10Citoid, 10VisualEditor, 10WMSE-Bug-Reporting-and-Translation-2021, and 2 others: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 (10JosefineHellrothLarssonWMSE) Unfortunately, this issue seems to remain unsolved or the bug has r... [16:41:21] 10serviceops, 10Citoid, 10VisualEditor, 10WMSE-Bug-Reporting-and-Translation-2021, and 2 others: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 (10Joe) If the problem was solved and has come back, I would imagine we get banned from their syste... [16:47:59] 10serviceops, 10Citoid, 10VisualEditor, 10WMSE-Bug-Reporting-and-Translation-2021, and 2 others: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 (10matmarex) 05Resolved→03Open [17:07:20] 10serviceops, 10Performance-Team: Investigate performance degradation at high concurrencies in php-fpm - https://phabricator.wikimedia.org/T293630 (10Krinkle) >>! In T280497#7460370, @aaron wrote: > I have some scripts in my home dir on mwdebug1001.eqiad.wmnet ([..] apcu_rw_test.php). > [18:12:56] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) >>! In T294271#7460829, @hashar wrote: >start of morning in Dallas will work just fine. Cool, thanks! So, @Papaul maybe you wa... [18:42:13] 10serviceops, 10SRE, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) a:03Legoktm [20:55:14] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [20:55:48] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) 05Open→03Resolved **Production URL **testing (1.929.416 URLs) results in https://people.wikimedia.org/~akosiaris/prod_urls... [21:47:33] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Papaul) @Dzahn Next week Monday 1st at 9:30 am CT [21:59:10] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10hashar) It is an holiday here in France (All-saints) , then I am not critical to the DRAC upgrade ;) I will make arrangement, it will b... [22:00:44] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) @hashar I am wondering if you need me around (for mgmt access / root / +2 ). I have a request to be off that day but it's not sur... [22:02:39] mutante: if things go south like the server dieing, surely we would need someone able to restore from backup [22:03:03] not sure whether All saint holiday is widely observed [22:06:17] anyway I should be sleeping really ;) [22:10:21] hashar: I don't think it's worth the risk to touch the main CI server anymore, tbh [22:10:36] I pushed for the hw replacement [22:11:08] we can just not do this and ACK the alert and focus on replacing the whole box [22:11:56] or do the whole "switch back to eqiad" ticket first and schedule that independently (if you think it's needed and worth it) [22:12:12] and after that we get back to ops-codfw and tell them they can just do it [22:24:41] 10serviceops, 10Citoid, 10VisualEditor, 10WMSE-Bug-Reporting-and-Translation-2021, and 2 others: Automatic citation generation using ISBN on Wikipedia doesn't work - https://phabricator.wikimedia.org/T294010 (10Mvolz) >>! In T294010#7462059, @Joe wrote: > If the problem was solved and has come back, I woul... [22:31:25] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) After re-thinking this and chatting some more on IRC I now think we should not do this and close my own request as invalid. It's... [22:33:44] 10serviceops, 10Release-Engineering-Team: contint hardware refresh? - https://phabricator.wikimedia.org/T294276 (10Dzahn) [22:36:07] 10serviceops, 10Release-Engineering-Team: contint hardware refresh? - https://phabricator.wikimedia.org/T294276 (10Dzahn) If we order new hardware here this can also be combined with switching the main server back to eqiad (T256422) (or not). Not directly related though except it might be useful for bringing u... [22:39:15] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) 05Open→03Declined Suggesting to do this once T256422 is resolved or T294276 or CI does not run on contint* servers anymore, w... [22:39:56] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) Be bold and reopen if you really think otherwise. [22:44:45] 10serviceops, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10SRE: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10Dzahn) P.S. The actual "contint2001.mgmt" alert in Icinga is actually quite some time ago.. not worth it. but there are other alerts (IP...