[00:00:58] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10Dzahn) [00:01:56] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10Dzahn) unless blubber is supposed to be just used via blubberoid. Then feel free to decline but let's update the wikitech tutori... [00:02:18] PROBLEM - Check systemd state on thanos-be1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:32] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:32] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:38] PROBLEM - Check systemd state on thanos-be1004 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:42] PROBLEM - Check systemd state on thanos-be2004 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:48] PROBLEM - Check systemd state on thanos-be1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:30] PROBLEM - Check systemd state on thanos-be2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab1004.eqiad.wmnet'] ` and were **ALL** successful. [00:05:46] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:32] PROBLEM - Check systemd state on thanos-fe2003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:46] PROBLEM - Check systemd state on thanos-be2003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:10] PROBLEM - Check systemd state on thanos-fe1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:46] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/696724 [00:27:16] PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:19] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10RobH) 05Open→03Resolved [00:30:51] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10RobH) [00:37:30] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:32] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 257371864 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:58] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:14] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 200016 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:37:40] (03PS1) 10Jforrester: ExtensionDistributor: REL1_36 is now the stable release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696736 (https://phabricator.wikimedia.org/T279455) [01:40:06] (03CR) 10Jforrester: [C: 03+2] ExtensionDistributor: REL1_36 is now the stable release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696736 (https://phabricator.wikimedia.org/T279455) (owner: 10Jforrester) [01:40:54] (03Merged) 10jenkins-bot: ExtensionDistributor: REL1_36 is now the stable release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696736 (https://phabricator.wikimedia.org/T279455) (owner: 10Jforrester) [01:43:46] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:696736|ExtensionDistributor: REL1_36 is now the stable release (T279455)]] (duration: 00m 57s) [01:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:50] T279455: Release MW 1.36.0 - https://phabricator.wikimedia.org/T279455 [01:44:09] Hurrah. [01:46:39] ^ [01:47:03] thcipriani: I also did a draft re-write of https://www.mediawiki.org/wiki/Release_checklist [01:47:08] \o/ [01:47:28] And on that note, I'm off for the long weekend. :-) [01:47:30] that is already much simpler [01:47:36] thanks for that [01:47:41] thcipriani: Happy to help. [01:47:42] enjoy your long weekend [01:47:46] And you! [01:48:00] to quote a chad I know: I don't sleep, I wait [01:48:05] * James_F laughs. [01:48:08] Yes, well. [01:48:19] :D [01:48:24] No doubt people will find bugs in 1.36.0, but at least it's finally out. [01:50:11] we don't write perfect software!? [01:50:18] * thcipriani shocked [01:50:44] we can't even setup the branches properly [01:50:45] * Reedy grins [01:50:54] We write software with a disclaimer of any warranty, no less. [01:51:21] :) [02:28:33] (03CR) 10Chico Venancio: [C: 03+1] "This is the last one, right? The last one is live now. Thanks for doing this @Reedy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [02:33:12] (03PS3) 10Reedy: Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) [02:38:32] (03PS4) 10Reedy: Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) [02:39:38] (03CR) 10jerkins-bot: [V: 04-1] Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [02:41:48] (03PS5) 10Reedy: Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) [02:42:58] (03CR) 10jerkins-bot: [V: 04-1] Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [02:44:48] (03PS6) 10Reedy: Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) [03:18:10] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:37] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:06] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:36] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:35:20] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210528T0700) [07:05:14] 10SRE, 10Wikimedia-Hackathon-2021, 10Wikimedia-Mailing-lists, 10Upstream: Add OAuth login to mailman for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Tgr) Upstream patch is [[https://github.com/pennersr/django-allauth/pull/2873|pennersr/django-allauth#2873]]. [07:49:18] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1011.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [07:49:25] PROBLEM - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:49:32] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:49:52] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:49:58] * volans here [07:50:00] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1011.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [07:50:10] acked the page [07:50:14] I can turn my computer back on [07:50:28] here as well [07:51:07] RECOVERY - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:51:08] legoktm: go to bed [07:51:22] o/ [07:51:36] around as well [07:51:56] volans: ack, ty :) [07:52:08] I don't see much on https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=wdqs-internal&var-instance=All&var-datasource=thanos&from=now-1h&to=now [07:52:21] https://config-master.wikimedia.org/pybal/eqiad/wdqs-internal [07:52:29] there are three hosts in the pool, one is not pooled [07:52:50] maybe it was depooled to recover from lag [07:53:17] I see some wdqs transfers yesterday on SAL, but none referring 1011 [07:53:46] the one depooled is 1003 [07:54:03] 1011 seems to be failing for some reason (maybe high traffic?), at least this is my understanding [07:54:21] its loadavg is very low [07:54:24] ah yes https://sal.toolforge.org/production?p=0&q=wdqs1003&d= [07:54:26] I was having a look at logs [07:54:31] it was depooled yesterday by Ryan [07:54:41] gehel: my any chance around? [07:55:04] or ryankemper, but ofc I expect you to be sleeping at this time [07:55:26] elukey: ack thx [07:56:13] 1003 might be not available for https://phabricator.wikimedia.org/T280382#7114272 [07:56:27] all GETs on wdqs1011 are 500 on nginx logs [07:56:38] I suspect blazegraph is not happy [07:56:49] PROBLEM - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:57:21] ah recovery + page [07:57:37] ERROR c.b.r.sail.webapp.BigdataRDFServlet - cause=java.util.concurrent.TimeoutException, query=SPARQL-QUERY: queryStr=#wbqc [07:57:38] * volans was looking at https://wikitech.wikimedia.org/wiki/Service_restarts#Wikidata_Query_Service_(WDQS) [07:57:40] etc.. [07:57:47] so I'd restart blazegraph on 1011 [07:58:22] if you know it can be restarted alone without other coordination go ahead, do we need to depool it first? [07:58:29] not that is working ofc [07:58:30] Acked the new page [07:58:31] RECOVERY - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:58:34] thanks sobanski [07:58:39] dcausse: are you around? [07:58:57] I am not near a pc 0 [07:59:10] even 1008 is not working great [07:59:22] <_joe_> has anyone checked pybal? [07:59:31] volans: It has been a while since I made a wdqs restart, but back then they needed to be depooled [07:59:37] elukey: according to https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Blazegraph_deadlock [07:59:46] <_joe_> moritzm: confirm they do [07:59:48] The short-term steps are to restart blazegraph on the affected node(s). This will unstick the process, but it can become deadlocked again from subsequent queries, thus why it is necessary to identify the offender(s) [07:59:58] _joe_ not yet, but there are three services in the pool, one pooled=false (due to reimage etc..) and other two under some blazegraph issue [08:00:26] volans: I agree yes [08:00:28] <_joe_> I see also wdqs1004/1003 in pybal that are down [08:00:39] although it doesn't seem a deadlock according to https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m&var-cluster_name=wdqs [08:00:44] <_joe_> so pybal sees 1003,1004 and 1011 down [08:00:50] the docs says if null values are rported is deadlock [08:01:12] <_joe_> let's restart one of them at least [08:01:15] +1 [08:01:23] shall I proceed with 1011? [08:01:24] <_joe_> elukey: are you restarting 1011? [08:01:26] <_joe_> yes go on [08:01:31] sure [08:01:54] <_joe_> these machines are in many different pools, btw [08:02:02] PROBLEM - Query Service HTTP Port on wdqs1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:02:08] elukey: +1 [08:02:11] !log restart blazegraph on wdqs1011 [08:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:24] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:02:25] <_joe_> May 28 08:02:16 lvs1015 pybal[24265]: [wdqs-internal_80] INFO: Leaving previously pooled but down server wdqs1011.eqiad.wmnet pooled [08:02:44] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:02:50] <_joe_> ok let's go with 1004 I'd say [08:02:59] <_joe_> I'll do it if no one objects [08:03:16] <_joe_> uh it's not actually a server anymore [08:03:25] <_joe_> I think someone forgot to remove them from conftool [08:03:32] the other one is 1008 afaics [08:03:46] <_joe_> elukey, volans can you see wdqs1003/1004 anywhere [08:03:50] RECOVERY - Query Service HTTP Port on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:04:05] <_joe_> elukey: wait please, I think I know what's wrong [08:04:10] https://config-master.wikimedia.org/pybal/eqiad/wdqs-internal [08:04:15] yes yes I am not taking actions [08:04:23] 1003 and 1008 are active in netbox [08:04:28] 1003 is down for long maintenance [08:04:28] <_joe_> volans: uhm [08:04:42] there is a task (see above) [08:04:44] 1004 is staged [08:04:46] <_joe_> volans: not in the host keys for ssh anymore [08:05:03] <_joe_> so I can explain what happens to -internal [08:05:09] <_joe_> you have 3 servers in the pool [08:05:14] <_joe_> 1 of which is always down [08:05:23] <_joe_> it means we can't depool even 1 server from pybal [08:05:28] :/ [08:05:31] <_joe_> the solution is to set 1003 to inactive [08:05:58] but wdqs1003 was depooled yesterday already (=no, not =inactive) [08:06:20] <_joe_> ok, so? [08:06:24] 1008 is part of the pool right? [08:06:27] <_joe_> I'm telling you why we got paged [08:06:41] <_joe_> we got paged because we can't depool any server in that pool [08:06:44] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:06:44] ah sure [08:06:46] yes [08:06:52] !log oblivian@cumin1001 conftool action : set/pooled=inactive; selector: name=wdqs1003.eqiad.wmnet,dc=eqiad [08:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:12] going to add a note in https://phabricator.wikimedia.org/T280382 about --^ [08:07:23] <_joe_> I fear something similar is happening with 1004 [08:07:30] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:08:11] <_joe_> ok, no [08:08:19] <_joe_> 1004 is just a nuisance in pybal logs [08:08:32] <_joe_> ok, perfect, crisis averted I'd say [08:08:41] should we restart also 1008? [08:08:56] let's check logs first [08:08:59] <_joe_> volans: why? it's working AFAICT [08:09:13] right, I was checking now its logs [08:09:14] looks good [08:10:24] nginx looks good but I still see some timeouts and exceptions in blazegraph's logs (like the ones on 1011) [08:10:49] but last one was some mins ago [08:11:30] so probably good, then ryankemper and gehel can follow up when they are online (nothing seems on fire) [08:11:55] <_joe_> ok, I GTG, but we really need to point people again to the docs about conftool state and pybal state [08:12:31] makes sense [08:12:48] I also blame volans as last step of this incident resolution [08:13:02] (it seems a good thing to do) [08:13:03] :) [08:13:05] lol [08:13:09] happy rest of the day folks! [09:24:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:26:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:56:42] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696535 (https://phabricator.wikimedia.org/T283380) (owner: 10MarcoAurelio) [10:34:34] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:16] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:41:00] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:05] 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10Eugene.chernov) Thank you @BBlack. I've sent you the email to confirm the act on Route53 verification. [14:01:37] (03CR) 10Nikki Nikkhoui: "Tested locally in Minikube with no glaring issues when looking at logs !" [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [14:28:02] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:29:46] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:43:03] (03PS1) 10Andrew-WMDE: admin: Remove previous SSH key for Andrew Kostka [puppet] - 10https://gerrit.wikimedia.org/r/697063 (https://phabricator.wikimedia.org/T283940) [14:51:24] 10SRE, 10SRE-Access-Requests: Enroll Andrew Kostka’s YubiKey for production access - https://phabricator.wikimedia.org/T283355 (10Andrew-WMDE) @Marostegui Thanks! It looks it working so I went ahead and removed the old key in T283940. [15:09:03] (03PS9) 10Superyetkin: Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) [16:00:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "+1 for CI, but I also left a comment on the task that should be resolved first, I think." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [16:31:50] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:25] (03PS1) 10Sahilgrewalhere: selenium: Upgrade WebdriverIO to v7 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) [20:14:51] ACKNOWLEDGEMENT - Check systemd state on thanos-be1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:51] ACKNOWLEDGEMENT - Check systemd state on thanos-be1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:51] ACKNOWLEDGEMENT - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:51] ACKNOWLEDGEMENT - Check systemd state on thanos-be1004 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:51] ACKNOWLEDGEMENT - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:52] ACKNOWLEDGEMENT - Check systemd state on thanos-be2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:52] ACKNOWLEDGEMENT - Check systemd state on thanos-be2003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:53] ACKNOWLEDGEMENT - Check systemd state on thanos-be2004 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:53] ACKNOWLEDGEMENT - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:54] ACKNOWLEDGEMENT - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:54] ACKNOWLEDGEMENT - Check systemd state on thanos-fe1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:55] ACKNOWLEDGEMENT - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service cole_white T283951 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state