[00:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [00:10:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:15:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639198 (10phaultfinder) [00:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128021 [00:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128021 (owner: 10TrainBranchBot) [00:42:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:50:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128021 (owner: 10TrainBranchBot) [01:08:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128022 [01:08:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128022 (owner: 10TrainBranchBot) [01:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639206 (10phaultfinder) [01:26:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1128022 (owner: 10TrainBranchBot) [02:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639215 (10phaultfinder) [02:26:27] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1952345016 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:27:27] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 34040 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:30:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:15] PROBLEM - snapshot of s4 in codfw on backupmon1001 is CRITICAL: snapshot for s4 at codfw (db2239) taken more than 3 days ago: Most recent backup 2025-03-13 02:37:57 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639259 (10phaultfinder) [03:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639281 (10phaultfinder) [04:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [04:42:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:13] 06SRE, 10SRE-swift-storage, 07Upstream: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10639347 (10Cyberdog958) >>! In T383053#10638929, @DavidEppstein wrote: > We have similar behavior being reported again at https://en.wikipedi... [05:48:09] PROBLEM - snapshot of s3 in codfw on backupmon1001 is CRITICAL: snapshot for s3 at codfw (db2239) taken more than 3 days ago: Most recent backup 2025-03-13 05:30:05 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:04:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639349 (10phaultfinder) [06:30:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:40:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639352 (10phaultfinder) [07:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639353 (10phaultfinder) [08:00:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [08:04:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639380 (10phaultfinder) [08:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:06] 06SRE, 06Traffic-Icebox, 07HTTPS, 07Wikimedia-Performance-recommendation: Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034#10639396 (10TuukkaH) >>! In T238034#5655025, @Vgutierrez wrote: > We should consider QUIC and HTTP/3 adoption carefully as it implies a swi... [08:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639403 (10phaultfinder) [08:42:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639424 (10phaultfinder) [09:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639425 (10phaultfinder) [09:49:44] 10SRE-swift-storage, 06Commons, 10media-backups, 10MediaWiki-File-management: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996#10639433 (10Nemo_bis) [09:50:48] 06SRE, 10SRE-swift-storage, 06Traffic: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671#10639435 (10Nemo_bis) [10:25:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639439 (10phaultfinder) [10:54:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639482 (10phaultfinder) [11:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639486 (10phaultfinder) [11:27:57] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:59:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639492 (10phaultfinder) [12:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [12:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639496 (10phaultfinder) [12:42:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:26] FIRING: InboundMXQueueHigh: MX host mx-in2001:9154 has many queued messages: 1613 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [13:35:29] I'll take a look [13:35:31] !incidents [13:35:32] 5742 (UNACKED) InboundMXQueueHigh sre (mx-in2001:9154 codfw) [13:35:36] !ack 5742 [13:35:36] 5742 (ACKED) InboundMXQueueHigh sre (mx-in2001:9154 codfw) [13:36:01] godog: I'm around, but I only really know exim :( [13:36:16] The queue started ramping up yesterday around 8 AM [13:36:31] Sorry, today [13:36:40] qshape looks to have a huge number <20 minutes old [13:38:11] unclear from the linked documentation what actions we could take [13:38:42] postqueue -v says [13:38:46] postqueue: warning: unix_trigger: write to public/qmgr: Broken pipe [13:38:46] [13:39:16] !log restart postfix on mx-in2001 [13:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:35] now postqueue -v -f doesn't emit that warning [13:40:55] Emperor: what do you reckon is the situation? [13:41:43] I don't know, maybe the qmgr got wedged? Looking at the dashboard for just mx-in2001, it looks like its topping out at 5 ops/s mail removal, which is less than the incoming rate [13:41:52] since about 12:23 [13:42:53] ack [13:43:55] looks like a _lot_ of things are bounces queued up to vrts-bounce@wikimedia.org ? [13:45:06] having a look at a couple of messages to see if they're similar [13:45:23] 36k log hits out of 590k according to logstash, is your info from postfix directly? [13:45:49] I'm poking at postfix, but I've not used postfix before, so take with bucket of salt [13:47:07] OK, postqueue tells me that of the 2999 queued mails, 2931 are to vrts-bounce@wikimedia.org [13:48:04] and all 3 of them that I've sampled are bounces to the same gmail address [13:48:59] P74233 (NDA) is the relevant bit of the bounce [13:49:35] Yeah, I've got the same proportion from tightening the logstash query [13:49:37] ok looks like you folks got it, I got to step out [13:49:50] I think we _might_ have some sort of mail loop?? the message in the bounce is from VRTS [13:50:12] let me put a whole message into another paste [13:52:16] https://phabricator.wikimedia.org/P74234 (NDA) [13:54:30] So AFAICT VRTS is sending a lot of messages of the form "ticket has been created in queue Junk", seemingly in response to a bounce message, to one particular gmail account, which is then rejecting them because that gmail account is getting inundated, and then those bounces are getting queued up to go to vtrs-bounce@wikimedia.org [13:55:11] and these messages are arriving at a rate higher than we can shift them out to vtrs-bounces [13:55:48] I'd say drop the messages, maybe should go take a look at mx-out to see if there are a bunch of them in outbound queue as well? [13:56:39] yeah, mx-out has 20k messages in outbound queue [13:57:00] claime: OK, do you want to try and can them on mx-out, and I'll look at dropping them from mx-in2001 ? [13:57:01] well 14k now [13:57:25] I'm gonna take a look but I'm not well versed in postfix either [13:57:28] https://wikitech.wikimedia.org/wiki/Postfix at least has a rune for "can everything in the queue for [address]" [13:57:36] yeah [13:59:23] !log sudo postqueue -j | jq -r ' select(.recipients[0].address == "vrts-bounce@wikimedia.org") | select(.recipients[1].address == null) | .queue_id' | sudo postsuper -d - # mx-in2001 [13:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:12] that Giant Hammer has cleared the queue on mx-in2001 but presumably mx-out will just fill it up again. [14:00:39] claime: need any help on mx-out? [14:04:26] RESOLVED: InboundMXQueueHigh: MX host mx-in2001:9154 has many queued messages: 3374 #page - https://wikitech.wikimedia.org/wiki/Postfix - https://grafana.wikimedia.org/d/h36Havfik/mail-postfix-servers - https://alerts.wikimedia.org/?q=alertname%3DInboundMXQueueHigh [14:04:42] on it [14:05:16] dropping all mails from vrts-bounce [14:05:28] Delete All The Things :D [14:07:49] postsuper: Deleted: 9802 messages [14:07:59] score, I only got about 3k :) [14:08:19] !log sudo postqueue -j | jq -r 'select(.sender == "vrts-bounce@wikimedia.org") | .queue_id' | sudo postsuper -d - # mx-out1001 [14:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:07] (03PS1) 10Anzx: sqwiktionary: update logo, wordmark, tagline and icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128032 (https://phabricator.wikimedia.org/T342172) [14:09:58] I'll open a phab task against vrts so the owners can take a look on Monday. Do you think we need to do anything else now? [14:10:12] Thanks, can you tag J.esse as well? [14:10:51] I think we're probably good [14:11:13] sobanski: ack [14:19:31] T389004 opened, which I think is the headlines. [14:19:32] T389004: VRTS bounces filled mail queues, resulting in a weekend page - https://phabricator.wikimedia.org/T389004 [14:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639704 (10phaultfinder) [14:37:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:38] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:57] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [16:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:01] (03PS1) 10Majavah: P:wmcs: wikireplicas: Fix fr_actor not being exposed [puppet] - 10https://gerrit.wikimedia.org/r/1128041 (https://phabricator.wikimedia.org/T383491) [18:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10639802 (10phaultfinder) [18:42:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:57] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [20:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [20:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:25] FIRING: [11x] SystemdUnitFailed: confd_prometheus_metrics.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640035 (10phaultfinder) [21:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640042 (10phaultfinder) [22:42:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640102 (10phaultfinder) [23:08:30] (03PS1) 10Jforrester: search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) [23:10:05] (03CR) 10CI reject: [V:04-1] search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [23:12:35] (03PS2) 10Jforrester: search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) [23:16:45] (03CR) 10Jforrester: [C:03+1] P:wmcs: wikireplicas: Fix fr_actor not being exposed [puppet] - 10https://gerrit.wikimedia.org/r/1128041 (https://phabricator.wikimedia.org/T383491) (owner: 10Majavah) [23:26:01] (03CR) 10Ladsgroup: "Thanks! I will deploy this first thing tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1128041 (https://phabricator.wikimedia.org/T383491) (owner: 10Majavah) [23:27:57] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-7cff657b4f-6pxt7:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10640153 (10phaultfinder)