[00:05:51] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [00:05:57] (03PS6) 10Dzahn: git::clone: allow gitlab as a source [puppet] - 10https://gerrit.wikimedia.org/r/766851 [00:06:38] (03PS2) 10Dzahn: add 15.wikipedia.org to cert for miscweb behind istio ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) [00:10:03] (03PS3) 10Dzahn: add 15.wikipedia to cert and gateway hosts for miscweb behind istio ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) [00:13:51] (03CR) 10Dzahn: "[deploy1002:~] $ curl --resolve "15.wikipedia.org:4111:staging.svc.eqiad.wmnet" 'https://15.wikipedia.org'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [00:16:00] (03CR) 10Dzahn: "curl -s --resolve "15.wikipedia.org:4111:staging.svc.eqiad.wmnet" 'https://15.wikipedia.org' | grep grandpa => ";Wikipedia is like an all" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [00:17:44] !log 15.wikipedia.org on k8s (staging) deploy1002:~] $ curl -s --resolve "15.wikipedia.org:4111:staging.svc.eqiad.wmnet" 'https://15.wikipedia.org' | grep grandpa => "“Wikipedia is like an all-knowing grandpa.”" | T300171 [00:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:53] T300171: move micro sites from ganeti to kubernetes - https://phabricator.wikimedia.org/T300171 [00:19:57] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.433e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [00:32:29] (03PS1) 10STran: Add IPInfo viewing rights for certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) [00:36:03] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:31] (03CR) 10STran: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [00:40:21] (03CR) 10STran: "FAILURE No change detected against the current configuration. in 37s (non-voting)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [00:44:35] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:47] (Processor usage over 85%) firing: (2) Alert for device scs-eqsin.mgmt.eqsin.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [01:13:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [01:14:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [01:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21614 and previous config saved to /var/cache/conftool/dbconfig/20220301-011404-ladsgroup.json [01:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:08] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [01:25:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:37:32] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:37] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:48:11] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T0200) [02:01:41] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.495e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [02:07:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.24 [core] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/766891 [02:07:20] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.24 [core] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/766891 (owner: 10TrainBranchBot) [02:14:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21615 and previous config saved to /var/cache/conftool/dbconfig/20220301-021424-ladsgroup.json [02:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:30] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [02:22:57] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.24 [core] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/766891 (owner: 10TrainBranchBot) [02:29:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21616 and previous config saved to /var/cache/conftool/dbconfig/20220301-022928-ladsgroup.json [02:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:26] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Dylsss) [02:44:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P21617 and previous config saved to /var/cache/conftool/dbconfig/20220301-024433-ladsgroup.json [02:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T302185)', diff saved to https://phabricator.wikimedia.org/P21618 and previous config saved to /var/cache/conftool/dbconfig/20220301-025938-ladsgroup.json [02:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:42] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [03:30:03] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [03:44:09] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.556e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [04:00:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [04:11:37] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [04:14:29] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.574e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [04:17:35] RECOVERY - Disk space on deneb is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [04:25:19] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.581e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [04:52:47] (Processor usage over 85%) firing: (2) Alert for device scs-eqsin.mgmt.eqsin.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [05:22:35] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [05:36:19] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.623e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [05:40:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:44:13] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [05:57:51] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.636e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [06:33:39] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [06:46:36] <_joe_> !log uploaded scap 4.4.1 to {stretch,buster,bullseye} [06:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:49] <_joe_> !log uploaded scap 4.4.1 to {stretch,buster,bullseye} T302464 [06:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:52] T302464: Deploy Scap version 4.4.1 - https://phabricator.wikimedia.org/T302464 [06:47:49] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.666e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [06:56:45] !log restart purged on cp6001 to clear stale kafka TLS consumer state (or attempting to) [06:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:18] seems working [06:57:43] !log oblivian@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): T302464 test [06:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:47] T302464: Deploy Scap version 4.4.1 - https://phabricator.wikimedia.org/T302464 [06:57:58] ah nope [06:58:00] !log oblivian@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): T302464 test (duration: 00m 17s) [06:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:28] nevermind, it works, I see messages now being processed on cp6001 [07:00:22] vgutierrez: o/ purged on cp6001 is happily processing kafka msgs, but of course the backlog is a lot.. I just realized that we may not want to do it and just reset the kafka status on the cp nodes (if possible), so I am not going to restart the other purged for the moment, lemme know what you think about it [07:02:25] yep.. drmrs has been offline again and purge requests have piled up [07:04:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [07:15:43] vgutierrez: I can take care of the restarts (staggered) if you want, cp6001 has almost recovered [07:15:46] https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&var-datasource=drmrs%20prometheus%2Fops&var-instance=cp6001&from=now-3h&to=now [07:16:06] maybe better to not restart all at once to avoid some pressure on kafka main (I doubt it but better be safe) [07:17:20] (I tailed cp6001's logs and I don't see the tls errors anymore) [07:19:03] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.685e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [07:26:09] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6001 is OK: (C)5000 gt (W)3000 gt 591.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6001 [07:49:36] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks for following up" [software/spicerack] - 10https://gerrit.wikimedia.org/r/766813 (owner: 10Volans) [07:51:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/766871 (https://phabricator.wikimedia.org/T302687) (owner: 10JHathaway) [07:59:16] !log restart purged on cp6002 [07:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Amir1, awight, Urbanecm, and taavi: Dear deployers, time to do the UTC morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T0800). [08:00:05] Jayme: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:30] Good morning! [08:00:45] Cc: jayme --^ [08:01:28] morning! [08:01:48] i can deploy, or jayme can self-service? [08:02:31] it should be safe to deploy, Joe +1ed it, I have no idea atm how to quickly check afterwards that all is working [08:03:13] we can probably check traffic in https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation?orgId=1 [08:03:33] you mean the usage of the link recommendation? I can help with that (I'm from Growth). It's not used for anything critical, too [08:03:42] super :) [08:03:47] let's do it then [08:04:14] w/o jayme around? if you say so :). elukey: want to do the deploy, or should I? [08:04:35] urbanecm: please go ahead, I am very rusty with mw deployments [08:04:58] (03CR) 10Urbanecm: [C: 03+2] Use service-proxy to connect to linkrecommendation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766780 (https://phabricator.wikimedia.org/T302719) (owner: 10JMeybohm) [08:04:59] okay :) [08:05:06] if anything goes bad we can blame Janis [08:05:13] :D [08:05:15] win/win [08:05:15] :D [08:05:51] sorry, I'm a bit late urbanecm [08:06:06] no problem jayme :). elukey already told me to start :D [08:06:09] (03Merged) 10jenkins-bot: Use service-proxy to connect to linkrecommendation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766780 (https://phabricator.wikimedia.org/T302719) (owner: 10JMeybohm) [08:06:53] I'm going to sync it w/o testing at a debug server, as this service is only used from the mwmaint server [08:08:26] (03PS1) 10Muehlenhoff: Remova Ema from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/767046 [08:08:29] !log urbanecm@deploy1002 Synchronized wmf-config/ProductionServices.php: d149208dfd7e5fbf51f44dd0bf7dae3b2e2f5159: Use service-proxy to connect to linkrecommendation (T302719) (duration: 00m 49s) [08:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:33] T302719: MediaWiki should use service-proxy to connect to Add Link / Linkrecommendation - https://phabricator.wikimedia.org/T302719 [08:08:49] (03PS2) 10Muehlenhoff: Remova Ema from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/767046 [08:08:56] moritzm: :( [08:10:10] does anyone know whether maintenance jobs at mwmaint need to be restarted to pick their new config? Wondering whether we need to restart `mediawiki_job_growthexperiments-refreshLinkRecommendations-s*` to make the config take effect. [08:10:48] yes [08:11:20] well, if it's a forwikiindblist type script, then whenever it starts the next wiki it'll use the new config [08:12:09] I'd prefer restarting it immediately -- it can run on a single wiki for a long time. [08:13:13] but I'm not sure how to actually restart it. Since it runs at www-data, I can just kill the processes, but...surely there's a better way [08:13:20] *as [08:13:29] ask a root to `systemctl restart [name]` [08:14:50] (not me, I'm not actually here and my yubikey is on the other side of the room :p) [08:14:59] :D [08:15:24] elukey: jayme: can one of you do it? for `mediawiki_job_growthexperiments-refreshLinkRecommendations-sX` (X is from 1 to 8) [08:15:53] urbanecm: sure [08:16:22] (03CR) 10Muehlenhoff: [C: 03+2] Remova Ema from Icinga permissions [puppet] - 10https://gerrit.wikimedia.org/r/767046 (owner: 10Muehlenhoff) [08:16:59] !log drain instances off ganeti2008 for eventual decom [08:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:24] urbanecm: done [08:18:29] thanks [08:18:59] assuming that's only needed on mwmaint100 as active mwmaint host [08:19:29] yes, no action on the passive host needed [08:19:56] the logs at `/var/log/mediawiki/mediawiki_job_growthexperiments-refreshLinkRecommendations-s2/syslog.log` indeed show it restarted and seems still running [08:20:05] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/766851 (owner: 10Dzahn) [08:20:55] I do see traffic rising in service-proxy [08:21:06] even better :) [08:21:12] so it looks we're done? [08:21:52] looks good to me if it's only a mwmaint thing (I wasn't aware) [08:22:41] ftr: https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=misc&var-origin_instance=All&var-destination=linkrecommendation [08:23:17] yeah, it should be (api.wikimedia.org also exposes it for external usage, but I don't know if that goes through MW or not) [08:25:30] !log restart purged on cp6003 [08:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:38] urbanecm: no, that's from api-gateway directly und not affected by this change [08:26:40] thanks! [08:26:59] okay, great. Happy to help :) [08:27:01] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [08:27:08] !log UTC morning B&C window done [08:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:21] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6002 is OK: (C)5000 gt (W)3000 gt 1283 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6002 [08:29:19] urbanecm: is it generally considered "safe" to restart those jobs? I'll potentially have to do it again the upcoming days when moving linkrecommendation to ingress [08:29:44] jayme: yes, they can be restarted at any time. [08:30:35] ok, cool [08:33:29] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10SRE Observability, 10netops: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229 (10ayounsi) This regularly alerts and is not actionable as it's a monitoring glitch. The CPU usage on the device is for example: `Cpu(s): 0.3%us, 0.0%sy, 0... [08:34:22] jayme: :oo is ingress ready to use now? [08:35:56] legoktm: yes. static-bugzilla is using it atm but I'd like to gain some more experience with some more/consistent traffic before taking on shellbox [08:36:08] :)) gotcha [08:36:10] (03CR) 10Volans: [V: 03+2 C: 03+2] "Overriding jenkins as the failure is due to upstream prospector and there is another fix for that." [software/spicerack] - 10https://gerrit.wikimedia.org/r/766813 (owner: 10Volans) [08:42:03] (03PS1) 10Muehlenhoff: Remove access for amuigai [puppet] - 10https://gerrit.wikimedia.org/r/767050 [08:42:19] (03CR) 10jerkins-bot: [V: 04-1] bandit: ignore hardcoded password in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/766813 (owner: 10Volans) [08:42:47] (Processor usage over 85%) firing: (2) Alert for device scs-eqsin.mgmt.eqsin.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [08:47:47] (Processor usage over 85%) resolved: Alert for device scs-ulsfo.mgmt.ulsfo.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [08:51:32] (03PS1) 10Vgutierrez: site: Reimage cp2039 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767051 (https://phabricator.wikimedia.org/T290005) [08:53:02] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2039 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767051 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [08:54:03] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2039.codfw.wmnet with OS buster [08:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:15] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster [08:54:59] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6003 is OK: (C)5000 gt (W)3000 gt 1079 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6003 [08:57:01] !log restart purged on cp6004 [08:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:37] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:02:11] ^^ expected [09:04:02] vgutierrez: FYI with jobunavailable alert on alertmanager we can silence/ack individual job/sites (wasn't possible in icinga) [09:04:17] if you are expecting more that is [09:04:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for amuigai [puppet] - 10https://gerrit.wikimedia.org/r/767050 (owner: 10Muehlenhoff) [09:04:34] oh awesome [09:06:02] !log restart purged on cp6005 [09:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:04] (03CR) 10Vgutierrez: [C: 03+2] haproxy::tls_terminator: Log Host header [puppet] - 10https://gerrit.wikimedia.org/r/766770 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:08:14] (03CR) 10Vgutierrez: [C: 03+2] mtail::cache_haproxy: Provide haproxy_client_healthcheck_ttfb histogram [puppet] - 10https://gerrit.wikimedia.org/r/766771 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:08:43] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10SRE Observability, 10netops: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229 (10fgiunchedi) Agreed the librenms patch is the way to go, I won't have the bandwidth any time soon but happy to assist [09:09:52] (03CR) 10Filippo Giunchedi: "LGTM, though please test in pontoon o11y stack if you haven't already" [puppet] - 10https://gerrit.wikimedia.org/r/766814 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [09:10:37] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:12:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2039.codfw.wmnet with reason: host reimage [09:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] api: remove http endpoint from pybal [puppet] - 10https://gerrit.wikimedia.org/r/766573 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:14:44] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2039.codfw.wmnet with reason: host reimage [09:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:37] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:16:12] (03PS1) 10Muehlenhoff: Remove access for zpapierski [puppet] - 10https://gerrit.wikimedia.org/r/767054 [09:17:49] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [09:20:33] <_joe_> !log restarted pybal on lvs2010 [09:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:37] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:20:45] <_joe_> uh [09:20:58] <_joe_> vgutierrez: that I get is your reimage ongoing, right? [09:22:14] <_joe_> !log restarted pybal on lvs2009, the mw api is now effectively https-only in codfw T287820 [09:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:18] T287820: Pybal doing HTTP requests resulting in a lot of log entries - https://phabricator.wikimedia.org/T287820 [09:22:43] (03PS2) 10David Caro: wmcs-cinder-backups: Increase timeout and decrease frequency [puppet] - 10https://gerrit.wikimedia.org/r/766816 (https://phabricator.wikimedia.org/T302720) [09:22:54] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging ZPapierski out of all services on: 1881 hosts [09:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging ZPapierski out of all services on: 1881 hosts [09:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:47] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Amuigai out of all services on: 1881 hosts [09:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:54] (03CR) 10David Caro: [C: 03+1] P:wmcs::prometheus: deploy alert rule from ops/alerts.git (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [09:25:01] <_joe_> !log manually removed ipvs entries on lvs2*, so it is actually now that the http api is not available in codfw anymore [09:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:10] (03CR) 10JMeybohm: [C: 03+1] kubernetes: Upgrade default envoy version to 1.15.5 [puppet] - 10https://gerrit.wikimedia.org/r/766840 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [09:25:37] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:25:41] !log restart varnishkafka-webrequest on cp6009 as attempt to clear a weird status of librdkafka (delivery errors to kafka) [09:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Amuigai out of all services on: 1881 hosts [09:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:23] (03CR) 10David Caro: wmcs-cinder-backups: Increase timeout and decrease frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766816 (https://phabricator.wikimedia.org/T302720) (owner: 10David Caro) [09:27:20] (03PS4) 10Majavah: P:wmcs::prometheus: deploy alert rule from ops/alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) [09:27:23] (03CR) 10Filippo Giunchedi: "Idea LGTM, likely needs testing in pontoon o11y stack at least to validate things on a basic level" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [09:27:32] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:27:41] (03CR) 10Majavah: P:wmcs::prometheus: deploy alert rule from ops/alerts.git (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [09:27:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for zpapierski [puppet] - 10https://gerrit.wikimedia.org/r/767054 (owner: 10Muehlenhoff) [09:28:01] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6004 is OK: (C)5000 gt (W)3000 gt 1063 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6004 [09:29:27] (03PS1) 10Vgutierrez: cache::haproxy: Add captured request headers to log-format [puppet] - 10https://gerrit.wikimedia.org/r/767055 (https://phabricator.wikimedia.org/T290005) [09:30:14] (03CR) 10Filippo Giunchedi: [C: 03+1] standard_packages: install freeipmi-ipmiseld on metal by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [09:31:17] <_joe_> !log restart pybal on lvs1020 [09:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:33] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.765e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [09:32:17] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34012/console" [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [09:33:11] (03CR) 10Gehel: "LGTM, but note that this probably affects ELK as well." [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [09:33:35] <_joe_> !log restarted pybal on lvs1019, removed the mw api from ipvsadm, the mw api is internally fully encrypted [09:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:31] <_joe_> Amir1: the deprecated calls to the api from pybal should be over [09:35:24] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34013/console" [puppet] - 10https://gerrit.wikimedia.org/r/766816 (https://phabricator.wikimedia.org/T302720) (owner: 10David Caro) [09:36:28] _joe_: thanks. I hunt down more stuff to he removed from deprecated log firehose [09:36:33] (03CR) 10David Caro: [V: 03+1] "> Patch Set 2: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/766816 (https://phabricator.wikimedia.org/T302720) (owner: 10David Caro) [09:36:38] *be [09:36:39] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Add captured request headers to log-format [puppet] - 10https://gerrit.wikimedia.org/r/767055 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:37:19] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6005 is OK: (C)5000 gt (W)3000 gt 966 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6005 [09:38:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: use tls for horizon->api connections [puppet] - 10https://gerrit.wikimedia.org/r/766281 (owner: 10Majavah) [09:45:29] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2039.codfw.wmnet with OS buster [09:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:42] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2039.codfw.wmnet with OS buster c... [09:48:31] (03CR) 10Ayounsi: [C: 03+2] drmrs: Add GTT links to OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/766857 (owner: 10Ayounsi) [09:48:49] !log elukey@stat1004:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the host) [09:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:55] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34017/console" [puppet] - 10https://gerrit.wikimedia.org/r/766816 (https://phabricator.wikimedia.org/T302720) (owner: 10David Caro) [09:54:30] (03CR) 10Arturo Borrero Gonzalez: "I like this idea, thanks!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766765 (owner: 10David Caro) [09:55:00] (03PS2) 10Giuseppe Lavagetto: api: remove non-https endpoint from backends [puppet] - 10https://gerrit.wikimedia.org/r/766574 (https://phabricator.wikimedia.org/T244843) [09:55:20] (03CR) 10Arturo Borrero Gonzalez: "I like this idea, thanks!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766766 (owner: 10David Caro) [09:59:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "this LGTM, but I'd like others to review this idea. Perhaps @bryan or @Andrew would have an opinion on this." [puppet] - 10https://gerrit.wikimedia.org/r/766292 (owner: 10Majavah) [10:00:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: redirect legacy ru_monuments to ru-monuments [puppet] - 10https://gerrit.wikimedia.org/r/762900 (https://phabricator.wikimedia.org/T301720) (owner: 10BryanDavis) [10:00:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] api: remove non-https endpoint from backends [puppet] - 10https://gerrit.wikimedia.org/r/766574 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [10:03:07] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) Ping? [10:05:02] (03PS8) 10Filippo Giunchedi: Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) [10:05:44] !log pool cp2039 running HAProxy as TLS termination layer - T290005 T271421 [10:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:49] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:05:49] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [10:10:58] (03CR) 10jerkins-bot: [V: 04-1] Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [10:12:53] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [10:15:28] (03CR) 10David Caro: [V: 03+1 C: 03+2] wmcs-cinder-backups: Increase timeout and decrease frequency [puppet] - 10https://gerrit.wikimedia.org/r/766816 (https://phabricator.wikimedia.org/T302720) (owner: 10David Caro) [10:20:43] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:34] (03CR) 10Filippo Giunchedi: [C: 03+1] P:wmcs::prometheus: deploy alert rule from ops/alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [10:24:09] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:28] (03PS3) 10David Caro: wmcs.toolforge.grid.get_cluster_status: improve yaml output [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766765 (https://phabricator.wikimedia.org/T302702) [10:26:30] (03PS3) 10David Caro: wmcs.toolforg.grid.get_cluster_status: allow filtering the ok ones [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766766 (https://phabricator.wikimedia.org/T302702) [10:27:01] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.797e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [10:27:08] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline (config file is missing .conf), also what Gehel said. Please also attach a PCC run" [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [10:28:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:15] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:41] (03PS1) 10Muehlenhoff: Require Python 3.7/buster for logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/767064 [10:31:25] !log restart purged on cp600[6-8] [10:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:13] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10akosiaris) [10:32:29] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [10:35:23] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6006 is OK: (C)5000 gt (W)3000 gt 1368 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6006 [10:36:16] (03PS1) 10Vgutierrez: site: Reimage cp3062 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767065 (https://phabricator.wikimedia.org/T290005) [10:39:11] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Ema out of all services on: 1353 hosts [10:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:45] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp3062 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767065 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:40:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Ema out of all services on: 1353 hosts [10:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:31] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Ema out of all services on: 259 hosts [10:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Ema out of all services on: 259 hosts [10:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:46] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp3062.esams.wmnet with OS buster [10:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:58] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster [10:42:59] (03PS2) 10Giuseppe Lavagetto: appservers: remove monitoring for http-only [puppet] - 10https://gerrit.wikimedia.org/r/766575 (https://phabricator.wikimedia.org/T244843) [10:45:00] (03PS1) 10Muehlenhoff: Remove access for ema [puppet] - 10https://gerrit.wikimedia.org/r/767067 [10:45:57] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.809e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [10:46:13] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6007 is OK: (C)5000 gt (W)3000 gt 704.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6007 [10:47:20] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED [10:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:32] (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:48:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] appservers: remove monitoring for http-only [puppet] - 10https://gerrit.wikimedia.org/r/766575 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [10:50:37] (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:52:22] (03PS1) 10Vgutierrez: mtail::atstls: Use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/767069 [10:52:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6008 is OK: (C)5000 gt (W)3000 gt 405.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6008 [10:53:40] (03CR) 10jerkins-bot: [V: 04-1] mtail::atstls: Use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/767069 (owner: 10Vgutierrez) [10:53:57] (03CR) 10Volans: "LGTM, one question inline" [puppet] - 10https://gerrit.wikimedia.org/r/767064 (owner: 10Muehlenhoff) [10:55:04] (03PS2) 10Vgutierrez: mtail::atstls: Use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/767069 [10:56:21] (03CR) 10jerkins-bot: [V: 04-1] mtail::atstls: Use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/767069 (owner: 10Vgutierrez) [10:56:51] (03CR) 10JMeybohm: [C: 03+1] miscweb: Restore envoy image_version to the inherited default [deployment-charts] - 10https://gerrit.wikimedia.org/r/766842 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [10:59:39] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [11:00:24] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED [11:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:08] (03PS2) 10Giuseppe Lavagetto: appserver: remove unencrypted LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/766576 (https://phabricator.wikimedia.org/T244843) [11:02:01] !log restart purged on cp60[09,10,11] [11:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:12] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10akosiaris) >>! In T302423#7733067, @jbond wrote: > @jhathaway thanks for writing this up just a few quick comments. > > In general i think that the foundation has always been [[ https://wikit... [11:02:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] appserver: remove unencrypted LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/766576 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:03:49] (03PS1) 10Hnowlan: api-gateway: move route_name metadata to route level [deployment-charts] - 10https://gerrit.wikimedia.org/r/767070 (https://phabricator.wikimedia.org/T295956) [11:07:01] (03CR) 10JMeybohm: "While the change is technically correct, I think I'm missing something." [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [11:07:48] <_joe_> !log restarted pybal on lvs2010, T244843 [11:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:51] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [11:09:00] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3062.esams.wmnet with reason: host reimage [11:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:41] (03PS1) 10Volans: redfish: DellSCP, allow creation of new entities [software/spicerack] - 10https://gerrit.wikimedia.org/r/767071 [11:10:00] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for ema [puppet] - 10https://gerrit.wikimedia.org/r/767067 (owner: 10Muehlenhoff) [11:11:21] <_joe_> !log restarted pybal on lvs2009, T244843 [11:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:34] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3062.esams.wmnet with reason: host reimage [11:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:07] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.825e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [11:13:23] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [11:14:17] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.1:80]) https://wikitech.wikimedia.org/wiki/PyBal [11:15:33] (03CR) 10jerkins-bot: [V: 04-1] redfish: DellSCP, allow creation of new entities [software/spicerack] - 10https://gerrit.wikimedia.org/r/767071 (owner: 10Volans) [11:16:07] <_joe_> uhhh I did remove it [11:16:09] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [11:17:17] <_joe_> !log restarting pybal on lvs1020 T244843 [11:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:22] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [11:17:26] !log rolled back linkrecommendation staging helm release to revision 12 - T302744 [11:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:32] T302744: Improved alerts/awareness if helm deployment of a service fails - https://phabricator.wikimedia.org/T302744 [11:18:21] <_joe_> !log also removed the ipvsadm entry for apaches:80 T244843 [11:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:09] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:21:46] <_joe_> !log restarted pybal, removed ipvsadm entry on lvs1019. Now all of MediaWiki has no http LVS endpoint available.T244843 [11:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:10] (03PS1) 10Volans: sre.hosts.provision: retry once on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/767073 [11:22:12] PROBLEM - LVS apaches codfw port 80/tcp - Main MediaWiki application server cluster- appservers.svc.codfw.wmnet IPv4 #page on appservers.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:22:12] (03PS1) 10Volans: sre.hosts.provision: always set the BiosBootSeq [cookbooks] - 10https://gerrit.wikimedia.org/r/767074 [11:22:26] <_joe_> wat [11:22:35] <_joe_> I just disabled all the alerts earlier [11:22:39] * volans here [11:22:41] :) [11:22:43] false alarm? [11:22:44] here [11:22:45] <_joe_> it's not a real alert [11:22:48] * Emperor here [11:22:50] <_joe_> as in [11:22:59] you remove dport 80 right? [11:23:00] <_joe_> I don't know why it wasn't removed from icinga [11:23:04] <_joe_> yes [11:23:07] checking [11:23:38] <_joe_> Notice: /Stage[main]/Icinga/Nagios_service[alert1001 appservers.svc.codfw.wmnet_apaches]/ensure: removed [11:23:40] <_joe_> Info: Computing checksum on file /etc/nagios/nagios_service.cfg [11:23:42] <_joe_> Notice: /Stage[main]/Icinga/Nagios_service[alert1001 appservers.svc.eqiad.wmnet_apaches]/ensure: removed [11:23:49] <_joe_> from the puppet run on alert1001 [11:23:57] <_joe_> sorry everyone :( [11:24:06] <_joe_> I did everything by the book, in theory [11:24:07] ah, probably just the usual async thingy [11:24:18] <_joe_> jynus: what async thingy? [11:24:47] even if doing it well, I am never 100% sure nagios does the right thing at the right time [11:24:59] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.provision: retry once on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/767073 (owner: 10Volans) [11:25:00] Here but on phone [11:25:01] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.provision: always set the BiosBootSeq [cookbooks] - 10https://gerrit.wikimedia.org/r/767074 (owner: 10Volans) [11:25:10] <_joe_> Amir1: go away, not a real alert [11:25:11] e.g. removed alerts still firing, etc. or started firing before removal and they still fire [11:25:20] Haha. Nice [11:25:39] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [11:25:49] <_joe_> can someone ack the alert on victorops? I am having problems logging in [11:26:04] I will [11:26:05] <_joe_> which are clearly a case of fat fingers on a phone keyboard [11:26:24] {done} [11:26:25] <_joe_> volans: did you find anything on alert1001? [11:26:32] <_joe_> what went wrong? [11:26:34] _joe_: no, it's not in /etc/ [11:26:39] Resolved it [11:26:49] nor icinga nor nagios config files [11:26:59] <_joe_> volans: maybe icinga didn't reload? [11:27:13] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1148.mgmt.eqiad.wmnet with reboot policy FORCED [11:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:20] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [11:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:44] 10SRE, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) 05Open→03Resolved [11:28:11] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [11:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:18] PROBLEM - LVS apaches eqiad port 80/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:28:29] <_joe_> sigh [11:28:33] Lol [11:28:36] <_joe_> ofc it's the same thing [11:28:37] _joe_: yes I don't see the reload before the applied catalog [11:28:49] <_joe_> volans: ok that is... strange [11:28:53] <_joe_> but let's reload now :) [11:28:57] <_joe_> should I do it? [11:29:44] This thing didn't die without a fight [11:29:59] <_joe_> I see [11:30:01] <_joe_> Process: 18531 ExecReload=/etc/init.d/icinga reload (code=exited, status=6) [11:30:13] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [11:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:29] "Icinga configuration contains errors" [11:30:39] <_joe_> yeah I think that's the case [11:31:02] maybe it is outdated, but fired 3h ago [11:31:37] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10faidon) >>! In T302617#7738776, @SCherukuwada wrote: > You're absolutely right to be concerned about traffic from search engines. That said, I'm familiar enough with how this works to be comfortable o... [11:31:39] <_joe_> sigh [11:31:41] <_joe_> Mar 01 11:30:53 alert1001 icinga[19659]: Error: Could not find any contact matching 'ema' (config file '/etc/icinga/objects/contactgroups.cfg', starting on line 42) [11:31:52] * Emperor kicked the next victorops alert [11:31:59] <_joe_> moritzm: ^^ [11:32:00] I think there was some related root spam this morning [11:32:13] (03PS1) 10JMeybohm: Make k8s-ingress-wikikube page [puppet] - 10https://gerrit.wikimedia.org/r/767078 (https://phabricator.wikimedia.org/T290966) [11:32:14] But that's probably just a mailing list update [11:32:15] <_joe_> that would make sense [11:32:29] PROBLEM - Number of messages locally queued by purged for processing on cp6011 is CRITICAL: cluster=cache_text instance=cp6011 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [11:32:40] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [11:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:04] agree, let me remove it from puppet private [11:33:22] <_joe_> yeah that's where it is [11:33:23] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [11:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:28] <_joe_> :( [11:34:13] modules/nagios_common/files/contactgroups.cfg has 'ema' in a bunch of times. Should I CR with them removed? [11:34:18] in regular puppet [11:34:27] <_joe_> Emperor: yes [11:34:32] doing so [11:34:45] (03PS1) 10Volans: icinga: remove ema from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/767079 [11:34:46] ^^^^ [11:34:54] <_joe_> see why no one should ever leave? [11:34:56] <-- too slow, as ever [11:35:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] icinga: remove ema from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/767079 (owner: 10Volans) [11:35:47] (03CR) 10Muehlenhoff: [C: 03+1] icinga: remove ema from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/767079 (owner: 10Volans) [11:35:53] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2001.codfw.wmnet [11:35:54] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [11:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:56] <_joe_> Emperor: don't despair, one day you'll be faster [11:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:04] (03CR) 10MVernon: [C: 03+1] "I was about to do similar myself :)" [puppet] - 10https://gerrit.wikimedia.org/r/767079 (owner: 10Volans) [11:36:06] (03CR) 10Volans: [C: 03+2] icinga: remove ema from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/767079 (owner: 10Volans) [11:36:11] <_joe_> I felt like that for the first few months here during incidents [11:36:21] sorry for that, ema's offboarding is still WIP and I had no idea that removal process in Icinga was so brittle... [11:36:28] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [11:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:33] <_joe_> always playing catch-up to them greeks [11:36:44] * volans running puppet on alert1001 [11:36:49] <_joe_> volans: thanks [11:37:11] <_joe_> and again sorry everyone for this, I was used to icinga erroring out when reloads failed [11:37:42] <_joe_> I wouldn't have proceeded otherwise [11:38:02] <_joe_> and btw, the issue was introduced after I made the api change, where I also went to check the icinga UI [11:38:03] RECOVERY - Number of messages locally queued by purged for processing on cp6011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [11:38:13] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6009 is CRITICAL: 1.386e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [11:38:32] (03PS1) 10Hnowlan: changeprop: add sampling configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/767080 (https://phabricator.wikimedia.org/T300914) [11:39:48] icinga reloaded and config correct [11:40:00] Mar 1 11:39:26 alert1001 icinga: Icinga 1.14.2 starting... (PID=6586) [11:42:48] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:43:10] <_joe_> volans: wikilove [11:43:54] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:16] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [11:45:12] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [11:52:27] (03PS1) 10Muehlenhoff: Remove ema from router config [homer/public] - 10https://gerrit.wikimedia.org/r/767083 [11:53:41] (03PS3) 10Vgutierrez: mtail::atstls: Use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/767069 [11:55:47] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10jcrespo) Adding @JMeybohm who is (I believe) on clinic duty this week. Normally I don't do this but we are particularly interested on getting admin access for someone on the Web te... [12:05:26] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [12:06:38] PROBLEM - traffic_server backend process restarted on cp6011 is CRITICAL: 11 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6011&var-layer=backend [12:09:32] PROBLEM - traffic_server tls process restarted on cp6011 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6011&var-layer=tls [12:09:46] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@51d5a07] (eqiad): Fix pool size configuration [12:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:56] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:10:26] (03PS2) 10Volans: sre.hosts.provision: retry once on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/767073 [12:10:28] (03PS2) 10Volans: sre.hosts.provision: always set the BiosBootSeq [cookbooks] - 10https://gerrit.wikimedia.org/r/767074 [12:11:33] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge.grid.get_cluster_status: improve yaml output [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766765 (https://phabricator.wikimedia.org/T302702) (owner: 10David Caro) [12:11:36] (03CR) 10David Caro: [C: 03+2] wmcs.toolforg.grid.get_cluster_status: allow filtering the ok ones [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766766 (https://phabricator.wikimedia.org/T302702) (owner: 10David Caro) [12:11:48] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@51d5a07] (eqiad): Fix pool size configuration (duration: 02m 01s) [12:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:35] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@51d5a07] (codfw): Fix pool size configuration [12:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:40] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.provision: always set the BiosBootSeq [cookbooks] - 10https://gerrit.wikimedia.org/r/767074 (owner: 10Volans) [12:14:37] (03Merged) 10jenkins-bot: wmcs.toolforge.grid.get_cluster_status: improve yaml output [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766765 (https://phabricator.wikimedia.org/T302702) (owner: 10David Caro) [12:14:39] (03Merged) 10jenkins-bot: wmcs.toolforg.grid.get_cluster_status: allow filtering the ok ones [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766766 (https://phabricator.wikimedia.org/T302702) (owner: 10David Caro) [12:15:16] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@51d5a07] (codfw): Fix pool size configuration (duration: 01m 41s) [12:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:34] (03PS2) 10Giuseppe Lavagetto: appserver: remove http pool from backends [puppet] - 10https://gerrit.wikimedia.org/r/766577 (https://phabricator.wikimedia.org/T244843) [12:22:11] (03PS2) 10Volans: redfish: DellSCP, allow creation of new entities [software/spicerack] - 10https://gerrit.wikimedia.org/r/767071 [12:22:13] (03PS1) 10Volans: prospector: ignore deprecation message [software/spicerack] - 10https://gerrit.wikimedia.org/r/767108 [12:22:59] (03CR) 10Volans: [C: 03+2] "Merging to unblock other patches." [software/spicerack] - 10https://gerrit.wikimedia.org/r/767108 (owner: 10Volans) [12:23:32] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/767074 (owner: 10Volans) [12:28:38] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@41d2498] (codfw): Reduce pool size to 1 connection per node worker [12:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:09] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@41d2498] (codfw): Reduce pool size to 1 connection per node worker (duration: 01m 30s) [12:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:05] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@41d2498] (eqiad): Reduce pool size to 1 connection per node worker [12:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:11] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@41d2498] (eqiad): Reduce pool size to 1 connection per node worker (duration: 01m 06s) [12:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:35] (03Merged) 10jenkins-bot: prospector: ignore deprecation message [software/spicerack] - 10https://gerrit.wikimedia.org/r/767108 (owner: 10Volans) [12:34:21] !log restart purged on cp60[12-14] [12:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:50] (03PS9) 10Volans: Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [12:35:29] (03CR) 10Ayounsi: [C: 03+1] Remove ema from router config [homer/public] - 10https://gerrit.wikimedia.org/r/767083 (owner: 10Muehlenhoff) [12:37:22] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [12:39:30] !log installing expat security updates [12:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] appserver: remove http pool from backends [puppet] - 10https://gerrit.wikimedia.org/r/766577 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:47:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3062.esams.wmnet with OS buster [12:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:52] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3062.esams.wmnet with OS buster c... [12:49:58] !log pool cp3062 running HAProxy as TLS termination layer - T290005 T271421 [12:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:03] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:50:03] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [12:50:41] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos) [12:52:03] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos) > Set up the traffic layer to send traffic to the service (if needed). This is a bit unclear to me currently. I am not sure fro... [12:52:07] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10JMeybohm) a:03JMeybohm From the context of T302617 I derive that this is widely approved. AIUI admin access basically means providing access to the shared accounts credentials w... [12:58:20] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [12:59:25] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Kormat) [13:03:07] !log restarting FPM/Apache on parsoid hosts to pick up expat update [13:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:17] (03PS1) 10Kormat: Prepare for 0.9 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/767117 [13:04:50] (03PS1) 10Filippo Giunchedi: misc: search-grafana-dashboards.js [software] - 10https://gerrit.wikimedia.org/r/767118 [13:05:04] (03CR) 10Kormat: [C: 03+2] Prepare for 0.9 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/767117 (owner: 10Kormat) [13:06:33] (03Merged) 10jenkins-bot: Prepare for 0.9 release. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/767117 (owner: 10Kormat) [13:09:20] (03PS4) 10Cathal Mooney: New function and changes to wmf-netbox plugin to support EVPN config. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) [13:09:46] (03CR) 10Cathal Mooney: [C: 03+2] Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney) [13:09:52] (03PS2) 10Hnowlan: changeprop: add sampling configuration, set num_workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/767080 (https://phabricator.wikimedia.org/T300914) [13:10:11] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney) [13:10:21] (03Merged) 10jenkins-bot: Change CR policy for creating aggregate Anycast routes [homer/public] - 10https://gerrit.wikimedia.org/r/765568 (https://phabricator.wikimedia.org/T302315) (owner: 10Cathal Mooney) [13:15:21] !log restart cr1-drmrs for software upgrade [13:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:34] PROBLEM - Host cr1-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:22:55] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [13:23:39] 10SRE, 10Observability-Metrics, 10Traffic: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10MMandere) [13:25:40] RECOVERY - Host cr1-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.36 ms [13:27:55] (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [13:29:12] (03PS1) 10Vgutierrez: site: Reimage cp1087 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767124 (https://phabricator.wikimedia.org/T290005) [13:30:27] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1087 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/767124 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:31:26] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6013 is CRITICAL: 1.271e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [13:31:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1087.eqiad.wmnet with OS buster [13:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:08] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster [13:32:50] !log restarting nginx on registry* nodes to pick up expat update [13:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10cmooney) Change has now been rolled out. All seems ok, aggregate route is still being created at POPs where it was previously, and announced exter... [13:34:34] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [13:34:52] PROBLEM - Confd vcl based reload on cp1089 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:37:04] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 2.161e+07 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [13:37:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10ayounsi) @cmooney thanks! @ssingh let me know when we're good to advertise DoH from drmrs @bblack let me know hwen we're good to advertise nsa.wiki... [13:39:02] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2002.codfw.wmnet [13:39:04] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [13:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:32] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2003.codfw.wmnet [13:39:33] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [13:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:42] !log klausman@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [13:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:45] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [13:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:01] feck, ctrl-c in the wrong window. [13:40:28] !log uploaded wmfmariadbpy 0.9 to apt.wm.o T302796 [13:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:31] T302796: Deploy wmfmariadbpy 0.9 - https://phabricator.wikimedia.org/T302796 [13:40:37] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:40:44] !log Deploying wmfmariadbpy 0.9 T302796 [13:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:52] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [13:43:34] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [13:43:39] !log klausman@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:42] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [13:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:58] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:43:59] !log klausman@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2003.codfw.wmnet [13:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:10] Today in "klausman sabotages himself...." [13:44:25] :D [13:44:43] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/767174 (https://phabricator.wikimedia.org/T135991) [13:45:37] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:46:15] (03CR) 10jerkins-bot: [V: 04-1] Enable profile::auto_restarts::service for puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/767174 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:46:50] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6013 is CRITICAL: 5.276e+07 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [13:47:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1087.eqiad.wmnet with reason: host reimage [13:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:56] PROBLEM - Number of messages locally queued by purged for processing on cp6014 is CRITICAL: cluster=cache_text instance=cp6014 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [13:48:16] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:48:17] !log klausman@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2002.codfw.wmnet [13:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:37] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2002.codfw.wmnet [13:48:39] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [13:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:47] Once more, with feeling [13:49:20] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [13:49:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1087.eqiad.wmnet with reason: host reimage [13:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:05] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Kormat) [13:50:37] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:50:59] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Kormat) >>! In T276589#7741816, @Kormat wrote: > I'm working on the wmfdb + wmfmariadbpy sides of this for data-persistence. wmfdb is good to go, i'm currently doing some testing with wmfmaria... [13:53:24] !log restart purged on cp60[15-16] [13:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:17] (03Abandoned) 10Hashar: scap config for mediawiki/tools/releases [puppet] - 10https://gerrit.wikimedia.org/r/701543 (https://phabricator.wikimedia.org/T274255) (owner: 10Hashar) [13:55:37] (JobUnavailable) firing: (2) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:56:12] (03CR) 10Kormat: Add Cumin alias to match core-test role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765562 (owner: 10Muehlenhoff) [13:56:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10JMeybohm) cc @Ottomata || @odimitrijevic for `analytics-privatedata-users` approval as of data.yaml [13:57:21] PROBLEM - Number of messages locally queued by purged for processing on cp6011 is CRITICAL: cluster=cache_text instance=cp6011 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [13:57:26] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10Ottomata) Approved [13:57:56] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [13:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:48] RECOVERY - Number of messages locally queued by purged for processing on cp6011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [14:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T1400). [14:00:04] No Gerrit patches in the queue for this window AFAICS. [14:00:20] indeed, nothing to do [14:00:45] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [14:01:02] (03Abandoned) 10Hashar: Set a CANARY env variable for mediawiki canaries [puppet] - 10https://gerrit.wikimedia.org/r/724695 (https://phabricator.wikimedia.org/T291870) (owner: 10Hashar) [14:01:04] (03Abandoned) 10Hashar: Split canary jobrunner to their own role [puppet] - 10https://gerrit.wikimedia.org/r/724694 (https://phabricator.wikimedia.org/T291870) (owner: 10Hashar) [14:01:08] (03Abandoned) 10Hashar: role: system::role for all mediawiki roles [puppet] - 10https://gerrit.wikimedia.org/r/730004 (owner: 10Hashar) [14:01:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10JMeybohm) [14:02:02] (03PS2) 10Muehlenhoff: Add Cumin alias to match core-test role [puppet] - 10https://gerrit.wikimedia.org/r/765562 [14:03:30] (03CR) 10jerkins-bot: [V: 04-1] Add Cumin alias to match core-test role [puppet] - 10https://gerrit.wikimedia.org/r/765562 (owner: 10Muehlenhoff) [14:03:33] PROBLEM - traffic_server tls process restarted on cp6010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6010&var-layer=tls [14:03:50] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:03:51] !log klausman@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2002.codfw.wmnet [14:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:57] PROBLEM - statsv Varnishkafka log producer on cp6010 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:07:45] RECOVERY - statsv Varnishkafka log producer on cp6010 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:09:35] !log restarting nginx on wdqs* nodes to pick up expat update [14:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:48] (03CR) 10Hashar: "Hi Daniel, this change is to disable the git reflog on the zuul merger, we don't need to keep an history of how the branches have been cha" [puppet] - 10https://gerrit.wikimedia.org/r/757943 (owner: 10Hashar) [14:10:15] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.932e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [14:10:18] RECOVERY - Number of messages locally queued by purged for processing on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [14:11:32] (03CR) 10Hashar: "Those are some more git config tuning for the zuul merger. That would cause git fetches to delete obsolete branches (ex: wmf branches) and" [puppet] - 10https://gerrit.wikimedia.org/r/757944 (https://phabricator.wikimedia.org/T220606) (owner: 10Hashar) [14:13:47] (03CR) 10Hashar: [C: 03+2] [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/756108 (owner: 10Hashar) [14:14:09] (03PS10) 10Ayounsi: Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) [14:14:57] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2003.codfw.wmnet [14:14:59] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [14:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:47] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:55] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [14:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:23] (03PS1) 10Filippo Giunchedi: alertmanager: open per-device librenms tasks [puppet] - 10https://gerrit.wikimedia.org/r/767179 (https://phabricator.wikimedia.org/T300836) [14:24:14] (03CR) 10Filippo Giunchedi: "I didn't find an easy way to test this for now unfortunately" [puppet] - 10https://gerrit.wikimedia.org/r/767179 (https://phabricator.wikimedia.org/T300836) (owner: 10Filippo Giunchedi) [14:24:22] (03PS1) 10Jcrespo: dbbackups: Setup x1 snapshots on cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/767181 (https://phabricator.wikimedia.org/T276589) [14:24:55] (03PS1) 10Muehlenhoff: Remove ganeti2008 [puppet] - 10https://gerrit.wikimedia.org/r/767182 (https://phabricator.wikimedia.org/T302078) [14:26:03] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Setup x1 snapshots on cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/767181 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [14:27:01] (03Merged) 10jenkins-bot: [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/756108 (owner: 10Hashar) [14:27:04] (03CR) 10Jcrespo: "Testing x1 backups on cumin2002. If they work as expected, we can migrate all other jobs and prepare for bullseye only backup orchestratio" [puppet] - 10https://gerrit.wikimedia.org/r/767181 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [14:28:05] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-etcd2001.codfw.wmnet [14:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:37] (03PS2) 10Jcrespo: dbbackups: Setup x1 snapshots on cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/767181 (https://phabricator.wikimedia.org/T276589) [14:30:17] (03PS1) 10Vgutierrez: mtail::atstls: Provide trafficserver_tls_client_healthcheck_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/767185 [14:31:40] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Setup x1 snapshots on cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/767181 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [14:32:09] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:32:10] !log klausman@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2003.codfw.wmnet [14:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:23] (03CR) 10jerkins-bot: [V: 04-1] mtail::atstls: Provide trafficserver_tls_client_healthcheck_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/767185 (owner: 10Vgutierrez) [14:35:34] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2002.codfw.wmnet [14:35:36] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [14:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:56] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1087.eqiad.wmnet with OS buster [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:10] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster c... [14:36:19] jenkins-bot seems unhappy for some random reason [14:36:30] !log pool cp1087 running HAProxy as TLS termination layer - T290005 T271421 [14:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:35] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:36:35] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [14:36:42] (03CR) 10Jcrespo: "This looks as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/767181 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [14:38:05] PROBLEM - traffic_server backend process restarted on cp6011 is CRITICAL: 100 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6011&var-layer=backend [14:38:46] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:39] RECOVERY - Confd vcl based reload on cp1089 is OK: reload-vcl successfully ran 0h, 3 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:40:58] I am getting some message "Could not find a definition for pool 'apaches'" on puppet CI [14:41:27] not sure if some parsing issue I am not seeing or something else [14:41:36] https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/40498/console [14:41:40] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [14:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:03] PROBLEM - traffic_server tls process restarted on cp6013 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6013&var-layer=tls [14:42:16] jynus, _joe_ that looks like afb501d82e [14:42:35] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [14:42:36] maybe the docker image got affected or something? [14:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:49] PROBLEM - traffic_server tls process restarted on cp6014 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6014&var-layer=tls [14:42:49] <_joe_> jynus: let me take a look [14:43:12] jynus: more like some tests expect that service definition that _joe_ removed to be in place [14:43:14] or maybe there is some additional refactoring needed for non production? I [14:43:17] ah, ok [14:43:28] it was so unrelated to my patch I was confused [14:43:30] <_joe_> yeah no idea why these tests didn't fire [14:43:37] <_joe_> yes it's just a CI issue [14:43:48] yeah, specifically because they worked for you and I think a later vgutierrez patch [14:43:53] so there was some delay there [14:44:13] and I was superconfused [14:44:13] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6015 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [14:48:31] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-etcd2002.codfw.wmnet [14:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:32] (03CR) 10Elukey: [C: 03+2] httpbb: Add some tests for ores [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) (owner: 10AikoChou) [14:50:20] (03CR) 10Elukey: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34019/console" [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) (owner: 10AikoChou) [14:51:05] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2003.codfw.wmnet [14:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:06] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [14:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:18] (03PS3) 10Bking: elastic: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) [14:52:32] !log elukey@deploy1002:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the node) [14:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:08] (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: fix tests [puppet] - 10https://gerrit.wikimedia.org/r/767189 [14:53:36] <_joe_> jynus: ^^ [14:54:28] the questions is if that would vote verified or not :-D [14:55:11] it does, so merge, I can rebase and recheck [14:55:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::lvs::realserver: fix tests [puppet] - 10https://gerrit.wikimedia.org/r/767189 (owner: 10Giuseppe Lavagetto) [14:55:28] (03CR) 10Bking: "Check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [14:56:38] (03PS3) 10Jcrespo: dbbackups: Setup x1 snapshots on cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/767181 (https://phabricator.wikimedia.org/T276589) [14:57:00] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10JMeybohm) [14:58:16] nice, _joe_, now it works, thank you! [14:59:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10JMeybohm) [14:59:52] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [14:59:57] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Amin Al Hazwani - https://phabricator.wikimedia.org/T302775 (10JMeybohm) cc @Ottomata || @odimitrijevic for `analytics-privatedata-users` approval as of data.yaml [15:00:39] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [15:01:18] 10SRE, 10Traffic, 10Upstream: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Vgutierrez) 05Open→03Resolved We've migrated the cp servers using envoy to HAProxy so this shouldn't be an issue anymore. [15:01:24] (03PS4) 10Vgutierrez: mtail::atstls: Use native histogram type [puppet] - 10https://gerrit.wikimedia.org/r/767069 [15:01:26] (03PS2) 10Vgutierrez: mtail::atstls: Provide trafficserver_tls_client_healthcheck_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/767185 [15:03:50] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Setup x1 snapshots on cumin2002 [puppet] - 10https://gerrit.wikimedia.org/r/767181 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [15:05:41] (03CR) 10JHathaway: [C: 03+2] Restrict filesystem_avail_bigger_than_size check to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/766871 (https://phabricator.wikimedia.org/T302687) (owner: 10JHathaway) [15:06:42] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-staging-etcd2003.codfw.wmnet [15:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [15:11:58] (03CR) 10JHathaway: [C: 03+1] standard_packages: install freeipmi-ipmiseld on metal by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [15:12:40] (03PS1) 10Klausman: Add DHCP and partman info for ML staging etcd VMs [puppet] - 10https://gerrit.wikimedia.org/r/767194 [15:14:47] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [15:15:40] (03PS2) 10Klausman: Add DHCP and partman info for ML staging etcd VMs [puppet] - 10https://gerrit.wikimedia.org/r/767194 (https://phabricator.wikimedia.org/T302503) [15:17:27] (03PS2) 10Giuseppe Lavagetto: conftool: remove http pools for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/766578 (https://phabricator.wikimedia.org/T244843) [15:21:19] !log ntsako@deploy1002 Started deploy [airflow-dags/analytics@cac16e8]: (no justification provided) [15:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:27] !log ntsako@deploy1002 Finished deploy [airflow-dags/analytics@cac16e8]: (no justification provided) (duration: 00m 07s) [15:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: remove http pools for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/766578 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:22:49] RECOVERY - traffic_server tls process restarted on cp6010 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6010&var-layer=tls [15:23:16] <_joe_> jhathaway: can I merge your patch too? [15:23:24] yes, thanks [15:24:27] <_joe_> done [15:24:31] RECOVERY - traffic_server tls process restarted on cp6011 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6011&var-layer=tls [15:27:03] RECOVERY - traffic_server tls process restarted on cp6013 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6013&var-layer=tls [15:28:23] RECOVERY - traffic_server tls process restarted on cp6014 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6014&var-layer=tls [15:31:45] PROBLEM - mediawiki-installation DSH group on mw1376 is CRITICAL: Host mw1376 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:07] PROBLEM - mediawiki-installation DSH group on mw1341 is CRITICAL: Host mw1341 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:15] PROBLEM - mediawiki-installation DSH group on mw1313 is CRITICAL: Host mw1313 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:21] PROBLEM - mediawiki-installation DSH group on mw1431 is CRITICAL: Host mw1431 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:35] PROBLEM - mediawiki-installation DSH group on mw2388 is CRITICAL: Host mw2388 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:33:11] PROBLEM - mediawiki-installation DSH group on mw1401 is CRITICAL: Host mw1401 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:33:19] PROBLEM - mediawiki-installation DSH group on mw1353 is CRITICAL: Host mw1353 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:34:24] That doesn't sound good [15:35:09] PROBLEM - mediawiki-installation DSH group on mw1387 is CRITICAL: Host mw1387 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:35:11] PROBLEM - mediawiki-installation DSH group on mw1381 is CRITICAL: Host mw1381 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:35:14] !log elukey@deploy1002 Started deploy [ores/deploy@29de1cc]: ORES Winter deployment - T300195 [15:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:18] T300195: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 [15:35:21] PROBLEM - mediawiki-installation DSH group on mw2273 is CRITICAL: Host mw2273 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:35:41] PROBLEM - mediawiki-installation DSH group on mw1368 is CRITICAL: Host mw1368 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:35:49] PROBLEM - mediawiki-installation DSH group on mw2257 is CRITICAL: Host mw2257 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:35:49] PROBLEM - mediawiki-installation DSH group on mw1411 is CRITICAL: Host mw1411 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:36:08] _joe_: ^ expected? [15:36:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "We plan to merge this tomorrow morning." [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [15:36:17] PROBLEM - mediawiki-installation DSH group on mw1326 is CRITICAL: Host mw1326 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:36:22] <_joe_> eeer no [15:36:24] <_joe_> sigh [15:36:28] <_joe_> yes that's on me [15:36:35] <_joe_> it's easy to fix though [15:36:39] PROBLEM - mediawiki-installation DSH group on mw1331 is CRITICAL: Host mw1331 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:36:39] PROBLEM - mediawiki-installation DSH group on mw2293 is CRITICAL: Host mw2293 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:36:39] PROBLEM - mediawiki-installation DSH group on mw2270 is CRITICAL: Host mw2270 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:36:39] PROBLEM - mediawiki-installation DSH group on mw2307 is CRITICAL: Host mw2307 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:36:43] PROBLEM - mediawiki-installation DSH group on mw2401 is CRITICAL: Host mw2401 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:36:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Port labs-in4/6 to Capirca (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/701347 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [15:37:35] PROBLEM - mediawiki-installation DSH group on mw2286 is CRITICAL: Host mw2286 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:37:47] <_joe_> thanks RhinosF1 [15:38:02] _joe_: no problem [15:38:13] PROBLEM - mediawiki-installation DSH group on mw1379 is CRITICAL: Host mw1379 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:03] PROBLEM - mediawiki-installation DSH group on mw1370 is CRITICAL: Host mw1370 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:09] PROBLEM - mediawiki-installation DSH group on mw1332 is CRITICAL: Host mw1332 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:15] PROBLEM - mediawiki-installation DSH group on mw2292 is CRITICAL: Host mw2292 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:39] PROBLEM - mediawiki-installation DSH group on mw1433 is CRITICAL: Host mw1433 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:47] PROBLEM - mediawiki-installation DSH group on mw1333 is CRITICAL: Host mw1333 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:47] PROBLEM - mediawiki-installation DSH group on mw1359 is CRITICAL: Host mw1359 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:47] PROBLEM - mediawiki-installation DSH group on mw1374 is CRITICAL: Host mw1374 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:55] PROBLEM - mediawiki-installation DSH group on mw1444 is CRITICAL: Host mw1444 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:39:55] PROBLEM - mediawiki-installation DSH group on mw1400 is CRITICAL: Host mw1400 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:40:14] (03PS1) 10Giuseppe Lavagetto: scap: fix dsh groups for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/767203 [15:40:23] PROBLEM - mediawiki-installation DSH group on mw1434 is CRITICAL: Host mw1434 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:40:23] PROBLEM - mediawiki-installation DSH group on mw2354 is CRITICAL: Host mw2354 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:40:39] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] scap: fix dsh groups for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/767203 (owner: 10Giuseppe Lavagetto) [15:40:41] PROBLEM - mediawiki-installation DSH group on mw2335 is CRITICAL: Host mw2335 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:40:41] PROBLEM - mediawiki-installation DSH group on mw2324 is CRITICAL: Host mw2324 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:40:55] PROBLEM - mediawiki-installation DSH group on mw2336 is CRITICAL: Host mw2336 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:40:55] PROBLEM - mediawiki-installation DSH group on mw2392 is CRITICAL: Host mw2392 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:40:59] PROBLEM - mediawiki-installation DSH group on mw1355 is CRITICAL: Host mw1355 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:40:59] PROBLEM - mediawiki-installation DSH group on mw2299 is CRITICAL: Host mw2299 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:41:21] (03CR) 10Filippo Giunchedi: "On closer inspection, I think we want this in ipmi::monitor" [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [15:41:51] PROBLEM - mediawiki-installation DSH group on mw1346 is CRITICAL: Host mw1346 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:42:07] <_joe_> puppet is painfully slow today [15:42:23] PROBLEM - mediawiki-installation DSH group on mw2331 is CRITICAL: Host mw2331 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:42:53] PROBLEM - mediawiki-installation DSH group on mw1452 is CRITICAL: Host mw1452 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:43:01] PROBLEM - mediawiki-installation DSH group on mw2350 is CRITICAL: Host mw2350 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:43:05] PROBLEM - mediawiki-installation DSH group on mw1407 is CRITICAL: Host mw1407 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:43:05] PROBLEM - mediawiki-installation DSH group on mw2376 is CRITICAL: Host mw2376 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:43:06] PROBLEM - mediawiki-installation DSH group on mw2368 is CRITICAL: Host mw2368 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:43:09] PROBLEM - mediawiki-installation DSH group on mw1399 is CRITICAL: Host mw1399 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:43:09] PROBLEM - mediawiki-installation DSH group on mw1398 is CRITICAL: Host mw1398 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:43:39] PROBLEM - mediawiki-installation DSH group on mw2269 is CRITICAL: Host mw2269 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:44:19] PROBLEM - mediawiki-installation DSH group on mw1339 is CRITICAL: Host mw1339 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:44:31] PROBLEM - mediawiki-installation DSH group on mw2294 is CRITICAL: Host mw2294 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:44:31] PROBLEM - mediawiki-installation DSH group on mw2301 is CRITICAL: Host mw2301 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:44:31] PROBLEM - mediawiki-installation DSH group on mw2310 is CRITICAL: Host mw2310 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:44:35] PROBLEM - mediawiki-installation DSH group on mw1425 is CRITICAL: Host mw1425 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:44:55] PROBLEM - mediawiki-installation DSH group on mw1388 is CRITICAL: Host mw1388 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:44:59] (03PS1) 10Giuseppe Lavagetto: scap: testservers doesn't use nginx [puppet] - 10https://gerrit.wikimedia.org/r/767208 [15:45:09] PROBLEM - mediawiki-installation DSH group on mw1329 is CRITICAL: Host mw1329 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:13] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] scap: testservers doesn't use nginx [puppet] - 10https://gerrit.wikimedia.org/r/767208 (owner: 10Giuseppe Lavagetto) [15:45:15] PROBLEM - mediawiki-installation DSH group on mw2371 is CRITICAL: Host mw2371 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:19] PROBLEM - mediawiki-installation DSH group on mw1382 is CRITICAL: Host mw1382 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:29] PROBLEM - mediawiki-installation DSH group on mw2405 is CRITICAL: Host mw2405 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:33] PROBLEM - mediawiki-installation DSH group on mw1342 is CRITICAL: Host mw1342 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:37] PROBLEM - mediawiki-installation DSH group on mw2352 is CRITICAL: Host mw2352 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:39] PROBLEM - mediawiki-installation DSH group on mw1454 is CRITICAL: Host mw1454 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:45] PROBLEM - mediawiki-installation DSH group on mw2367 is CRITICAL: Host mw2367 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:53] PROBLEM - mediawiki-installation DSH group on mw1406 is CRITICAL: Host mw1406 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:59] PROBLEM - mediawiki-installation DSH group on mw2312 is CRITICAL: Host mw2312 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:45:59] PROBLEM - mediawiki-installation DSH group on mw2387 is CRITICAL: Host mw2387 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:46:05] PROBLEM - mediawiki-installation DSH group on mw1432 is CRITICAL: Host mw1432 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:46:15] PROBLEM - mediawiki-installation DSH group on mw1435 is CRITICAL: Host mw1435 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:46:31] PROBLEM - mediawiki-installation DSH group on mw1343 is CRITICAL: Host mw1343 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:46:31] PROBLEM - mediawiki-installation DSH group on mw2325 is CRITICAL: Host mw2325 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:46:49] PROBLEM - mediawiki-installation DSH group on mw2295 is CRITICAL: Host mw2295 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:47:07] PROBLEM - mediawiki-installation DSH group on mw1397 is CRITICAL: Host mw1397 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:47:11] <_joe_> ok fixed [15:47:29] PROBLEM - mediawiki-installation DSH group on mw2277 is CRITICAL: Host mw2277 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:47:29] PROBLEM - mediawiki-installation DSH group on mw2296 is CRITICAL: Host mw2296 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:48:29] PROBLEM - mediawiki-installation DSH group on mw2372 is CRITICAL: Host mw2372 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:48:59] PROBLEM - mediawiki-installation DSH group on mw2358 is CRITICAL: Host mw2358 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:49:43] PROBLEM - mediawiki-installation DSH group on mw1405 is CRITICAL: Host mw1405 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:49:45] (03CR) 10Muehlenhoff: "check" [puppet] - 10https://gerrit.wikimedia.org/r/767174 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:49:49] PROBLEM - mediawiki-installation DSH group on mw1320 is CRITICAL: Host mw1320 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:49:53] PROBLEM - mediawiki-installation DSH group on mw2360 is CRITICAL: Host mw2360 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:50:21] PROBLEM - mediawiki-installation DSH group on mw1389 is CRITICAL: Host mw1389 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:50:27] PROBLEM - mediawiki-installation DSH group on mw1351 is CRITICAL: Host mw1351 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:51:31] <_joe_> ok now the recoveries shall arrive [15:51:38] <_joe_> from icinga [15:55:21] (03PS2) 10Herron: ipmi::monitor: install freeipmi-ipmiseld on metal by default [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) [15:56:46] (03CR) 10Muehlenhoff: ipmi::monitor: install freeipmi-ipmiseld on metal by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [15:56:49] (03CR) 10Filippo Giunchedi: [C: 03+1] ipmi::monitor: install freeipmi-ipmiseld on metal by default [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [15:57:35] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) > On a side note, I see there is a proposal of using /vendor/modules. It seems interesting and I 've never tried it, I am wondering what technical hurdles we 'd meet. Any ideas? Us... [15:59:02] (03PS1) 10Jcrespo: dbbackups: Migrate codfw DB snapshot orchestration from cumin2001 to 2002 [puppet] - 10https://gerrit.wikimedia.org/r/767212 (https://phabricator.wikimedia.org/T276589) [16:00:01] (03CR) 10Filippo Giunchedi: [C: 03+1] ipmi::monitor: install freeipmi-ipmiseld on metal by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [16:00:43] (03PS4) 10Bking: elastic: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) [16:01:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [16:01:23] (03CR) 10Herron: ipmi::monitor: install freeipmi-ipmiseld on metal by default (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [16:01:52] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:04:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [16:04:46] PROBLEM - SSH on cp6016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:05:22] !log restarting nginx on wcqs* nodes to pick up expat update [16:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:14] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Amin Al Hazwani - https://phabricator.wikimedia.org/T302775 (10Ottomata) Hiya, Could you give us a little more context as to why you need this access? What team you are working with, etc.? If you could, please follow the templa... [16:08:00] RECOVERY - SSH on cp6016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:08:46] PROBLEM - Number of messages locally queued by purged for processing on cp6013 is CRITICAL: cluster=cache_text instance=cp6013 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [16:09:37] (03CR) 10Herron: [C: 03+2] ipmi::monitor: install freeipmi-ipmiseld on metal by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [16:11:27] !log elukey@deploy1002 Finished deploy [ores/deploy@29de1cc]: ORES Winter deployment - T300195 (duration: 36m 13s) [16:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:34] \o/ [16:12:30] T300195: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 [16:12:45] !log restarting apache on logstash nodes to pick up expat update [16:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:10] (03CR) 10Herron: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767179 (https://phabricator.wikimedia.org/T300836) (owner: 10Filippo Giunchedi) [16:18:26] (03PS1) 10Genoveva Galarza: [WIP] Create charts for wikifunctions services [deployment-charts] - 10https://gerrit.wikimedia.org/r/767215 (https://phabricator.wikimedia.org/T295698) [16:18:40] RECOVERY - mediawiki-installation DSH group on mw1313 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:19:45] <_joe_> RhinosF1: I rescheduled the check there ^^ [16:19:52] RECOVERY - Number of messages locally queued by purged for processing on cp6013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [16:19:52] <_joe_> so it will recover eventually [16:19:58] Ah! [16:20:07] Thanks for the quick responses [16:20:19] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Migrate codfw DB snapshot orchestration from cumin2001 to 2002 [puppet] - 10https://gerrit.wikimedia.org/r/767212 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [16:20:33] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/767174 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:22:40] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) Based on the discussion so far my inclination is that we stick with our current method of vendoring Community modules in `./modules`. Though not a perfect solution, it seems to have... [16:23:47] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@cac16e8]: (no justification provided) [16:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:50] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@cac16e8]: (no justification provided) (duration: 00m 03s) [16:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:18] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10jcrespo) The x1 backup worked as expected, with normal performance. I just uploaded the 0.6 packages and migrated backups to cumin2002. I will do another large backup test, but other than tagg... [16:26:16] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 67 probes of 661 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:28:50] (03PS1) 10Tchanders: Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) [16:29:32] (03CR) 10Tchanders: [C: 04-2] "Stalled until we get the go ahead from Legal etc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) (owner: 10Tchanders) [16:30:09] (03PS5) 10Cwhite: logstash: add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:30:32] (03CR) 10Elukey: [C: 03+1] "I haven't checked the mac addresses but the rest looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/767194 (https://phabricator.wikimedia.org/T302503) (owner: 10Klausman) [16:31:40] PROBLEM - traffic_server tls process restarted on cp6016 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6016&var-layer=tls [16:31:44] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 57 probes of 661 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:32:09] (03CR) 10Klausman: Add DHCP and partman info for ML staging etcd VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767194 (https://phabricator.wikimedia.org/T302503) (owner: 10Klausman) [16:32:10] RECOVERY - mediawiki-installation DSH group on mw1376 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:32:15] (03CR) 10Klausman: [C: 03+2] Add DHCP and partman info for ML staging etcd VMs [puppet] - 10https://gerrit.wikimedia.org/r/767194 (https://phabricator.wikimedia.org/T302503) (owner: 10Klausman) [16:32:32] RECOVERY - mediawiki-installation DSH group on mw1341 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:32:50] RECOVERY - mediawiki-installation DSH group on mw1431 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:33:00] (03PS6) 10Cwhite: logstash: add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:33:10] RECOVERY - mediawiki-installation DSH group on mw2388 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:33:50] RECOVERY - mediawiki-installation DSH group on mw1401 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:34:02] RECOVERY - mediawiki-installation DSH group on mw1353 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:35:26] PROBLEM - Number of messages locally queued by purged for processing on cp6015 is CRITICAL: cluster=cache_text instance=cp6015 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [16:36:00] RECOVERY - mediawiki-installation DSH group on mw1387 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:36:02] RECOVERY - mediawiki-installation DSH group on mw1381 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:36:10] RECOVERY - mediawiki-installation DSH group on mw2273 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:36:20] PROBLEM - Number of messages locally queued by purged for processing on cp6011 is CRITICAL: cluster=cache_text instance=cp6011 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [16:36:30] RECOVERY - mediawiki-installation DSH group on mw1368 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:36:38] RECOVERY - mediawiki-installation DSH group on mw1411 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:36:38] RECOVERY - mediawiki-installation DSH group on mw2257 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:37:08] RECOVERY - mediawiki-installation DSH group on mw1326 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:37:32] RECOVERY - mediawiki-installation DSH group on mw1331 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:37:32] RECOVERY - mediawiki-installation DSH group on mw2270 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:37:32] RECOVERY - mediawiki-installation DSH group on mw2293 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:37:32] RECOVERY - mediawiki-installation DSH group on mw2307 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:37:36] RECOVERY - mediawiki-installation DSH group on mw2401 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:38:24] RECOVERY - mediawiki-installation DSH group on mw2286 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:39:00] RECOVERY - mediawiki-installation DSH group on mw1379 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:40:02] RECOVERY - mediawiki-installation DSH group on mw1370 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:40:08] RECOVERY - mediawiki-installation DSH group on mw1332 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:40:18] RECOVERY - mediawiki-installation DSH group on mw2292 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:40:28] (03CR) 10Cwhite: [C: 03+1] "Amended to use the "dot-delimited string form" in the tests as this is how rsyslog presents ecs_170-templated events to Logstash." [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:40:42] RECOVERY - mediawiki-installation DSH group on mw1433 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:40:52] RECOVERY - mediawiki-installation DSH group on mw1333 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:40:52] RECOVERY - mediawiki-installation DSH group on mw1359 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:40:52] RECOVERY - mediawiki-installation DSH group on mw1374 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:41:02] RECOVERY - mediawiki-installation DSH group on mw1400 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:41:02] RECOVERY - mediawiki-installation DSH group on mw1444 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:41:30] RECOVERY - mediawiki-installation DSH group on mw1434 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:41:30] RECOVERY - mediawiki-installation DSH group on mw2354 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:41:50] RECOVERY - mediawiki-installation DSH group on mw2324 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:41:50] RECOVERY - mediawiki-installation DSH group on mw2335 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:42:06] RECOVERY - mediawiki-installation DSH group on mw2336 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:42:06] RECOVERY - mediawiki-installation DSH group on mw2392 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:42:10] RECOVERY - mediawiki-installation DSH group on mw1355 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:42:10] RECOVERY - mediawiki-installation DSH group on mw2299 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:42:26] RECOVERY - Number of messages locally queued by purged for processing on cp6011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [16:43:00] RECOVERY - mediawiki-installation DSH group on mw1346 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:43:20] RECOVERY - Number of messages locally queued by purged for processing on cp6015 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [16:43:28] RECOVERY - mediawiki-installation DSH group on mw2331 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:43:56] RECOVERY - mediawiki-installation DSH group on mw1452 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:44:08] RECOVERY - mediawiki-installation DSH group on mw2350 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:44:12] RECOVERY - mediawiki-installation DSH group on mw1407 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:44:12] RECOVERY - mediawiki-installation DSH group on mw2368 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:44:12] RECOVERY - mediawiki-installation DSH group on mw2376 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:44:14] RECOVERY - mediawiki-installation DSH group on mw1398 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:44:14] RECOVERY - mediawiki-installation DSH group on mw1399 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:44:48] RECOVERY - mediawiki-installation DSH group on mw2269 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:45:30] RECOVERY - mediawiki-installation DSH group on mw1339 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:45:42] RECOVERY - mediawiki-installation DSH group on mw2294 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:45:42] RECOVERY - mediawiki-installation DSH group on mw2310 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:45:42] RECOVERY - mediawiki-installation DSH group on mw2301 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:45:48] RECOVERY - mediawiki-installation DSH group on mw1425 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:46:10] RECOVERY - mediawiki-installation DSH group on mw1388 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:46:26] RECOVERY - mediawiki-installation DSH group on mw1329 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:46:36] RECOVERY - mediawiki-installation DSH group on mw2371 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:46:42] RECOVERY - mediawiki-installation DSH group on mw1382 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:46:52] RECOVERY - mediawiki-installation DSH group on mw2405 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:46:58] RECOVERY - mediawiki-installation DSH group on mw1342 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:04] RECOVERY - mediawiki-installation DSH group on mw2352 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:04] RECOVERY - mediawiki-installation DSH group on mw1454 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:12] RECOVERY - mediawiki-installation DSH group on mw2367 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:18] RECOVERY - mediawiki-installation DSH group on mw1406 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:24] RECOVERY - mediawiki-installation DSH group on mw2312 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:24] RECOVERY - mediawiki-installation DSH group on mw2387 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:32] RECOVERY - mediawiki-installation DSH group on mw1432 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:42] RECOVERY - mediawiki-installation DSH group on mw1435 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:56] RECOVERY - mediawiki-installation DSH group on mw1343 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:47:56] RECOVERY - mediawiki-installation DSH group on mw2325 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:48:16] RECOVERY - mediawiki-installation DSH group on mw2295 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:48:34] RECOVERY - mediawiki-installation DSH group on mw1397 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:49:00] RECOVERY - mediawiki-installation DSH group on mw2277 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:49:00] RECOVERY - mediawiki-installation DSH group on mw2296 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:49:42] jouncebot now [16:49:42] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [16:49:59] I'm running a test on the deploy server [16:50:00] RECOVERY - mediawiki-installation DSH group on mw2372 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:50:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10JMeybohm) > - Email address: damien+wikimedia@desfontain.es This does not match the mail address we've recorded in LDAP, I'll be using the LDAP one instead. > I will als... [16:50:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10JMeybohm) > I will also need access to the wmf LDAP group That can't be granted to contractors without wikimedia.org mail address[1], I'll add you to `nda` instead. [1] h... [16:50:28] RECOVERY - mediawiki-installation DSH group on mw2358 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:51:12] RECOVERY - mediawiki-installation DSH group on mw1405 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:51:20] RECOVERY - mediawiki-installation DSH group on mw1320 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:51:26] RECOVERY - mediawiki-installation DSH group on mw2360 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:51:54] RECOVERY - mediawiki-installation DSH group on mw1389 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:52:00] RECOVERY - mediawiki-installation DSH group on mw1351 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:52:48] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767174 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:53:21] (03PS3) 10Hashar: gerrit: move CI result table to a tab [puppet] - 10https://gerrit.wikimedia.org/r/756685 [16:54:23] jouncebot next [16:54:23] In 0 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T1700) [16:54:23] In 0 hour(s) and 5 minute(s): Grafana 8 Upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T1700) [16:54:54] PROBLEM - Number of messages locally queued by purged for processing on cp6011 is CRITICAL: cluster=cache_text instance=cp6011 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [16:55:24] !log dancy@deploy1002 Started scap: testing container image build [16:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:26] PROBLEM - Check systemd state on mw2377 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:36] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:38] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Amin Al Hazwani - https://phabricator.wikimedia.org/T302775 (10aminalhazwani) Absolutely, thanks for pointing me to the template @Ottomata! - Wikitech username: Amin Al Hazwani - Email address: aalhazwani@wikimedia.org... [16:57:46] PROBLEM - Check systemd state on druid1004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:50] PROBLEM - Check systemd state on an-worker1088 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:50] PROBLEM - Check systemd state on cp1081 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:54] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:06] PROBLEM - Check systemd state on lvs6003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:06] PROBLEM - Check systemd state on parse2020 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:12] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:12] PROBLEM - Check systemd state on mw1445 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:14] PROBLEM - Check systemd state on db2146 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:27] ^? [16:58:28] PROBLEM - Check systemd state on mw1330 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:42] PROBLEM - Check systemd state on ganeti2016 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:44] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:46] PROBLEM - Check systemd state on mw1420 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:48] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:48] PROBLEM - Check systemd state on mw1333 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:50] PROBLEM - Check systemd state on mw2360 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:50] PROBLEM - Check systemd state on mw1440 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:52] PROBLEM - Check systemd state on cp5004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:52] PROBLEM - Check systemd state on mw2407 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:54] PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:54] PROBLEM - Check systemd state on mw2306 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:54] PROBLEM - Check systemd state on mw2282 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:55] ipmiseld.service [16:58:56] PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:59] is what's failing everywhere [16:59:02] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:04] PROBLEM - Check systemd state on kafka-main2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:11] see -sre [16:59:12] PROBLEM - Check systemd state on db1116 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:14] PROBLEM - Check systemd state on mw1391 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:18] PROBLEM - Check systemd state on mw1449 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:20] PROBLEM - Check systemd state on restbase2023 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:20] PROBLEM - Check systemd state on mw1315 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:32] PROBLEM - Check systemd state on ganeti2022 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:36] PROBLEM - Check systemd state on mw2352 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:36] PROBLEM - Check systemd state on mw2311 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:36] PROBLEM - Check systemd state on mw2382 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:40] PROBLEM - Check systemd state on thanos-be1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:44] PROBLEM - Check systemd state on mc1050 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:46] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:48] PROBLEM - Check systemd state on ganeti1017 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:52] PROBLEM - Check systemd state on logstash1033 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:52] PROBLEM - Check systemd state on mw2307 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:54] PROBLEM - Check systemd state on mw2351 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:54] PROBLEM - Check systemd state on mw2273 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:00] PROBLEM - Check systemd state on mw2292 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:04] PROBLEM - Check systemd state on mw2297 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:04] PROBLEM - Check systemd state on ganeti2028 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:04] PROBLEM - Check systemd state on kafka-jumbo1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:04] PROBLEM - Check systemd state on mw2389 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:05] cwhite: It is that lovely time of the day again! You are hereby commanded to deploy Grafana 8 Upgrade. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T1700). [17:00:06] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:06] PROBLEM - Check systemd state on parse2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:08] PROBLEM - Check systemd state on kafka-jumbo1008 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:08] PROBLEM - Check systemd state on backup2007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:10] PROBLEM - Check systemd state on lvs1018 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:10] PROBLEM - Check systemd state on mw1320 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:14] PROBLEM - Check systemd state on mw1361 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:14] PROBLEM - Check systemd state on cp4029 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:16] PROBLEM - Check systemd state on mw2403 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:18] PROBLEM - Check systemd state on mw1366 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:20] PROBLEM - Check systemd state on cp2041 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:20] PROBLEM - Check systemd state on mw2339 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:22] PROBLEM - Check systemd state on mw2335 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:24] PROBLEM - Check systemd state on cp4033 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:24] PROBLEM - Check systemd state on db1176 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:24] PROBLEM - Check systemd state on db1161 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:28] PROBLEM - Check systemd state on mw2411 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:30] PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:32] RECOVERY - Number of messages locally queued by purged for processing on cp6011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [17:00:38] PROBLEM - Check systemd state on wtp1036 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:38] PROBLEM - Check systemd state on dbstore1007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:40] PROBLEM - Check systemd state on ganeti1016 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:42] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:44] PROBLEM - Check systemd state on an-worker1091 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:46] PROBLEM - Check systemd state on an-worker1090 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:52] PROBLEM - Check systemd state on ganeti1026 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:55] rzl: I have a puppet patch [17:00:56] PROBLEM - Check systemd state on mw2254 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:56] PROBLEM - Check systemd state on kafka-main1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:02] PROBLEM - Check systemd state on an-druid1004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:04] PROBLEM - Check systemd state on an-worker1092 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:04] PROBLEM - Check systemd state on mc2026 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:04] PROBLEM - Check systemd state on cp4035 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:06] PROBLEM - Check systemd state on mw1407 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:07] that's a lot of unhappyness [17:01:10] PROBLEM - Check systemd state on wtp1033 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:16] PROBLEM - Check systemd state on mw1434 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:16] See -sre Emperor [17:01:18] PROBLEM - Check systemd state on mc-gp2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:18] dancy: sorry, I have a meeting conflict -- can I check in with you in 1h? [17:01:24] PROBLEM - Check systemd state on bast1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:25] ta [17:01:28] rzl: yes please [17:01:30] PROBLEM - Check systemd state on lvs2007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:32] PROBLEM - Check systemd state on lvs3005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:34] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Amin Al Hazwani - https://phabricator.wikimedia.org/T302775 (10mwilliams) Approved from my end! [17:01:36] PROBLEM - Check systemd state on cp2033 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:36] PROBLEM - Check systemd state on mw1382 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:38] PROBLEM - Check systemd state on cp4024 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:38] PROBLEM - Check systemd state on cp3059 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:40] PROBLEM - Check systemd state on dbproxy1018 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:42] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:42] PROBLEM - Check systemd state on mw2270 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:50] PROBLEM - Check systemd state on an-worker1111 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:50] PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:50] PROBLEM - Check systemd state on cp5001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:50] PROBLEM - Check systemd state on db1163 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:58] PROBLEM - Check systemd state on mw1408 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:00] PROBLEM - Check systemd state on mw2408 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:02] PROBLEM - Check systemd state on ganeti1027 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:06] PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:06] PROBLEM - Check systemd state on mw1355 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:08] PROBLEM - Check systemd state on kafka-main2005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:10] PROBLEM - Check systemd state on ganeti1008 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:12] PROBLEM - Check systemd state on dns4001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:12] PROBLEM - Check systemd state on mc2032 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:16] PROBLEM - Check systemd state on mw1338 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:18] PROBLEM - Check systemd state on wtp1048 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:26] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:26] PROBLEM - Check systemd state on mw1393 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:28] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:28] PROBLEM - Check systemd state on mw2272 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:28] PROBLEM - Check systemd state on mw2268 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:30] PROBLEM - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:30] PROBLEM - Check systemd state on logstash1035 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:32] PROBLEM - Check systemd state on mw2323 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:32] PROBLEM - Check systemd state on db1171 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:34] PROBLEM - Check systemd state on mw2409 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:36] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:36] PROBLEM - Check systemd state on puppetmaster1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:44] PROBLEM - Check systemd state on kafka-jumbo1006 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:50] PROBLEM - Check systemd state on mw1395 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:52] PROBLEM - Check systemd state on parse2012 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:54] PROBLEM - Check systemd state on ganeti1021 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:54] PROBLEM - Check systemd state on mw1421 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:56] PROBLEM - Check systemd state on parse2002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:56] PROBLEM - Check systemd state on ganeti1019 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:56] PROBLEM - Check systemd state on mw2338 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:58] PROBLEM - Check systemd state on logstash2029 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:00] PROBLEM - Check systemd state on ganeti4003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:06] PROBLEM - Check systemd state on mw1335 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:06] PROBLEM - Check systemd state on mw1342 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:10] PROBLEM - Check systemd state on mw1328 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:10] PROBLEM - Check systemd state on mw1343 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:12] PROBLEM - Check systemd state on mw2264 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:12] PROBLEM - Check systemd state on lvs5001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:12] PROBLEM - Check systemd state on mw2301 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:14] PROBLEM - Check systemd state on mw1430 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:16] PROBLEM - Check systemd state on cp3057 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:16] PROBLEM - Check systemd state on db1140 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:17] PROBLEM - Check systemd state on db1144 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:20] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:22] PROBLEM - Check systemd state on mw2371 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:22] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:24] PROBLEM - Check systemd state on mw2401 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:24] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:28] PROBLEM - Check systemd state on wtp1043 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:32] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:34] PROBLEM - Check systemd state on thanos-be2004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:38] PROBLEM - Check systemd state on cp5009 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:40] PROBLEM - Check systemd state on an-worker1100 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:42] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:54] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10Damiendf) [17:03:56] PROBLEM - Check systemd state on an-worker1089 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:56] PROBLEM - Check systemd state on db1112 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:56] PROBLEM - Check systemd state on mw1329 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:04] PROBLEM - Check systemd state on mw1451 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:06] PROBLEM - Check systemd state on mw2261 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:08] PROBLEM - Check systemd state on mw2328 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:09] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Amin Al Hazwani - https://phabricator.wikimedia.org/T302775 (10Ottomata) Approved, thank you! [17:04:12] PROBLEM - Check systemd state on backup1007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:14] PROBLEM - Check systemd state on logstash2035 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:14] PROBLEM - Check systemd state on logstash2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:16] PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:16] PROBLEM - Check systemd state on ganeti1025 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:18] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:18] PROBLEM - Check systemd state on mw1337 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:24] PROBLEM - Check systemd state on mw1306 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:26] PROBLEM - Check systemd state on mc2027 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:26] PROBLEM - Check systemd state on mw1347 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:28] PROBLEM - Check systemd state on logstash1011 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:28] PROBLEM - Check systemd state on mw1435 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:30] PROBLEM - Check systemd state on db2073 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:30] PROBLEM - Check systemd state on db1135 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:34] PROBLEM - Check systemd state on mw2260 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:34] PROBLEM - Check systemd state on dbstore1005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:34] PROBLEM - Check systemd state on mw2302 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:34] PROBLEM - Check systemd state on db2100 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:38] PROBLEM - Check systemd state on cp5016 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:38] PROBLEM - Check systemd state on kafka-logging1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:42] PROBLEM - Check systemd state on mc1037 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:44] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Sorry about that, the SRX just shipped today! So I'll have Jin work on this when he goes out to install the SRX for mr1-eqsin replacement sometime next week? [17:04:46] PROBLEM - Check systemd state on backup1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:50] PROBLEM - Check systemd state on mw2380 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:50] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:04:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10Damiendf) Arg sorry, this is the wrong email address. I corrected it in the initial request. [17:04:52] PROBLEM - Check systemd state on mw2263 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:52] PROBLEM - Check systemd state on mc2024 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:52] PROBLEM - Check systemd state on mw2321 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:54] PROBLEM - Check systemd state on db1130 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:54] PROBLEM - Check systemd state on mw1302 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:56] PROBLEM - Check systemd state on mw2400 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:04] PROBLEM - Check systemd state on db2145 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:04] PROBLEM - Check systemd state on mc2037 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:06] PROBLEM - Check systemd state on mw1304 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:06] PROBLEM - Check systemd state on mw2276 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:08] PROBLEM - Check systemd state on mw1383 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:10] PROBLEM - Check systemd state on mw1418 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:14] PROBLEM - Check systemd state on restbase2014 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:16] PROBLEM - Check systemd state on wtp1041 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:20] PROBLEM - Check systemd state on mw1376 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:20] PROBLEM - Check systemd state on mw2304 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:22] PROBLEM - Check systemd state on mw2359 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:22] PROBLEM - Check systemd state on mw2376 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:30] PROBLEM - Check systemd state on restbase2021 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:32] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:34] PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:34] PROBLEM - Check systemd state on ganeti2026 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:36] PROBLEM - Check systemd state on wtp1025 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:38] PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:38] PROBLEM - Check systemd state on mc2025 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:40] PROBLEM - Check systemd state on ms-backup2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:42] PROBLEM - Check systemd state on mw1438 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:42] PROBLEM - Check systemd state on dns5001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:46] PROBLEM - Check systemd state on ganeti-test2002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:46] PROBLEM - Check systemd state on mw2305 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:50] PROBLEM - Check systemd state on parse2016 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:56] PROBLEM - Check systemd state on mw1433 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:56] PROBLEM - Check systemd state on wtp1047 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:58] PROBLEM - Check systemd state on mw1324 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:00] PROBLEM - Check systemd state on cp4030 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:02] PROBLEM - Check systemd state on mw1362 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:02] PROBLEM - Check systemd state on mw2284 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:02] PROBLEM - Check systemd state on mw1409 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:06] PROBLEM - Check systemd state on cp3053 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:06] PROBLEM - Check systemd state on mw2384 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:08] PROBLEM - Check systemd state on parse2009 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:12] PROBLEM - Check systemd state on an-conf1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:16] PROBLEM - Check systemd state on cp5005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:16] PROBLEM - Check systemd state on wtp1037 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:16] PROBLEM - Check systemd state on backup2004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:17] (03PS1) 10Aqu: Set default Airflow concurrency limits [puppet] - 10https://gerrit.wikimedia.org/r/767220 (https://phabricator.wikimedia.org/T300870) [17:06:18] PROBLEM - Check systemd state on cp6007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:26] PROBLEM - Check systemd state on wtp1045 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:30] PROBLEM - Check systemd state on clouddb1016 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:30] PROBLEM - Check systemd state on analytics1072 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:34] PROBLEM - Check systemd state on mc-gp2002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:34] PROBLEM - Check systemd state on mw1336 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:34] PROBLEM - Check systemd state on mw1367 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:36] PROBLEM - Check systemd state on cp4023 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:36] PROBLEM - Check systemd state on mc2033 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:40] PROBLEM - Check systemd state on mw2298 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:44] PROBLEM - Check systemd state on mw2379 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:46] PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:50] PROBLEM - Check systemd state on mw1396 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:50] PROBLEM - Check systemd state on analytics1070 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:52] PROBLEM - Check systemd state on an-worker1098 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:56] PROBLEM - Check systemd state on backup2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:00] PROBLEM - Check systemd state on mw2383 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:02] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:06] PROBLEM - Check systemd state on analytics1062 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:06] RECOVERY - Check systemd state on bast1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:14] PROBLEM - Check systemd state on wtp1034 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:14] PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:14] PROBLEM - Check systemd state on cp2036 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:14] PROBLEM - Check systemd state on lvs4007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:20] PROBLEM - Check systemd state on dns2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:20] PROBLEM - Check systemd state on mw1341 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:20] PROBLEM - Check systemd state on mw1334 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:24] PROBLEM - Check systemd state on an-worker1087 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:26] PROBLEM - Check systemd state on an-worker1107 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:30] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:34] PROBLEM - Check systemd state on an-worker1101 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:34] PROBLEM - Check systemd state on mw2385 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:36] PROBLEM - Check systemd state on db1173 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:38] PROBLEM - Check systemd state on kafka-logging1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:40] PROBLEM - Check systemd state on cp1082 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:42] PROBLEM - Check systemd state on cp5010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:42] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:07:46] PROBLEM - Check systemd state on mw2322 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:48] PROBLEM - Check systemd state on db1134 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:50] PROBLEM - Check systemd state on mw1317 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:56] PROBLEM - Check systemd state on db2106 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:56] PROBLEM - Check systemd state on db2151 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:00] PROBLEM - Check systemd state on kafka-jumbo1009 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:02] PROBLEM - Check systemd state on kafka-main2004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:04] PROBLEM - Check systemd state on mc1048 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:06] PROBLEM - Check systemd state on mw1372 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:07] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:10] PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:10] PROBLEM - Check systemd state on mw1397 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:12] PROBLEM - Check systemd state on mw1452 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:12] PROBLEM - Check systemd state on mw2318 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:12] PROBLEM - Check systemd state on mw2299 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:15] (03PS1) 10Klausman: Add insetup role for ML staging etcd VMs [puppet] - 10https://gerrit.wikimedia.org/r/767221 (https://phabricator.wikimedia.org/T302503) [17:08:20] PROBLEM - Check systemd state on puppetmaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:20] PROBLEM - Check systemd state on mw1456 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:24] PROBLEM - Check systemd state on db1102 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:26] PROBLEM - Check systemd state on db1166 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:26] PROBLEM - Check systemd state on mw2257 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:34] PROBLEM - Check systemd state on db2130 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:34] PROBLEM - Check systemd state on mw1398 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:36] PROBLEM - Check systemd state on dumpsdata1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:36] PROBLEM - Check systemd state on mw1365 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:38] PROBLEM - Check systemd state on mw1401 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:40] PROBLEM - Check systemd state on mw1441 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:40] PROBLEM - Check systemd state on analytics1063 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:42] PROBLEM - Check systemd state on conf2006 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:44] PROBLEM - Check systemd state on restbase1028 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:52] PROBLEM - Check systemd state on mw1387 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:52] PROBLEM - Check systemd state on wtp1027 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:58] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:58] PROBLEM - Check systemd state on parse2007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:58] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:00] PROBLEM - Check systemd state on mw2300 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:02] PROBLEM - Check systemd state on cp3062 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:04] PROBLEM - Check systemd state on snapshot1010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:08] PROBLEM - Check systemd state on mw2390 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:08] PROBLEM - Check systemd state on mw2397 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:12] PROBLEM - Check systemd state on db1179 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:16] PROBLEM - Check systemd state on ganeti2029 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:16] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:20] PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:22] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:24] PROBLEM - Check systemd state on ganeti-test2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:24] PROBLEM - Check systemd state on db2136 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:26] PROBLEM - Check systemd state on cp6011 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:27] PROBLEM - Check systemd state on clouddb1020 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:09:30] PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:30] PROBLEM - Check systemd state on mw1325 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:32] PROBLEM - Check systemd state on db2139 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:38] PROBLEM - Check systemd state on mw2309 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:40] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:40] PROBLEM - Check systemd state on mw2404 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:42] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:44] PROBLEM - Check systemd state on mw1348 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:46] PROBLEM - Check systemd state on cp4036 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:46] PROBLEM - Check systemd state on mw1356 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:46] PROBLEM - Check systemd state on cp3056 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:52] PROBLEM - Check systemd state on mw2267 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:54] PROBLEM - Check systemd state on mw2333 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:54] PROBLEM - Check systemd state on mw2326 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:57] PROBLEM - Check systemd state on mw2399 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:02] PROBLEM - Check systemd state on lvs1020 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:02] PROBLEM - Check systemd state on cloudgw1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:04] PROBLEM - Check systemd state on restbase2013 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:05] (03CR) 10Hashar: [C: 04-1] "I have made some css adjustements. I have to test this against our current Gerrit 3.3 and the planned 3.4. If all is good I will check wit" [puppet] - 10https://gerrit.wikimedia.org/r/756685 (owner: 10Hashar) [17:10:06] PROBLEM - Check systemd state on logstash1029 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:08] PROBLEM - Check systemd state on cp4034 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:10:10] PROBLEM - Check systemd state on cp1083 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:10] PROBLEM - Check systemd state on db1148 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:10] PROBLEM - Check systemd state on wdqs2006 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:12] PROBLEM - Check systemd state on cp3063 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:12] PROBLEM - Check systemd state on cp3061 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:18] PROBLEM - Check systemd state on cp4022 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:24] PROBLEM - Check systemd state on ganeti2015 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:24] PROBLEM - Check systemd state on mw2337 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:24] PROBLEM - Check systemd state on mw2355 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:26] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:26] PROBLEM - Check systemd state on mw2332 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:26] PROBLEM - Check systemd state on an-worker1104 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:27] PROBLEM - Check systemd state on db1142 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:28] PROBLEM - Check systemd state on mw2289 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:32] PROBLEM - Check systemd state on cp1088 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:34] PROBLEM - Check systemd state on backup2005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:34] PROBLEM - Check systemd state on cp1085 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:34] PROBLEM - Check systemd state on cp1086 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:34] PROBLEM - Check systemd state on lvs6002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:38] PROBLEM - Check systemd state on cp6008 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:46] PROBLEM - Check systemd state on db2147 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:50] PROBLEM - Check systemd state on dbproxy1019 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:50] PROBLEM - Check systemd state on mc1041 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:52] PROBLEM - Check systemd state on mw1327 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:54] PROBLEM - Check systemd state on wdqs2008 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:56] PROBLEM - Check systemd state on mw1318 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:00] PROBLEM - Check systemd state on ganeti1023 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:04] PROBLEM - Check systemd state on kafka-main1004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:04] PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:06] PROBLEM - Check systemd state on mw2324 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:08] PROBLEM - Check systemd state on db1141 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:08] PROBLEM - Check systemd state on cp2040 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:10] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:10] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:12] PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:16] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:16] PROBLEM - Check systemd state on restbase2026 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:18] PROBLEM - Check systemd state on cp1080 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:22] PROBLEM - Check systemd state on db2140 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:22] PROBLEM - Check systemd state on mw2317 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:22] PROBLEM - Check systemd state on mw1340 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:26] PROBLEM - Check systemd state on mw1443 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:28] PROBLEM - Check systemd state on mw1428 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:30] PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:32] PROBLEM - Check systemd state on logstash1027 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:34] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:34] PROBLEM - Check systemd state on cp2038 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:34] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:34] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:34] PROBLEM - Check systemd state on mc-gp1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:36] PROBLEM - Check systemd state on mw2361 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:38] PROBLEM - Check systemd state on mw1305 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:42] PROBLEM - Check systemd state on restbase2018 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:42] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:42] PROBLEM - Check systemd state on mw2402 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:46] PROBLEM - Check systemd state on mw1354 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:50] PROBLEM - Check systemd state on mw2356 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:50] PROBLEM - Check systemd state on ganeti1028 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:52] PROBLEM - Check systemd state on cp3051 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:56] PROBLEM - Check systemd state on mw2365 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:58] PROBLEM - Check systemd state on mw1390 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:00] PROBLEM - Check systemd state on an-worker1123 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:04] PROBLEM - Check systemd state on cp5011 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:08] PROBLEM - Check systemd state on dbprov1001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:08] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:10] PROBLEM - Check systemd state on an-worker1080 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:10] PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:14] PROBLEM - Check systemd state on wcqs2002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:14] PROBLEM - Check systemd state on cp5008 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:14] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:16] PROBLEM - Check systemd state on mc-gp1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:18] PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:22] PROBLEM - Check systemd state on ganeti2030 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:24] PROBLEM - Check systemd state on ganeti1024 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:24] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:26] PROBLEM - Check systemd state on logstash1028 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:28] PROBLEM - Check systemd state on authdns2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:30] PROBLEM - Check systemd state on logstash2028 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:34] PROBLEM - Check systemd state on mw1307 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:40] PROBLEM - Check systemd state on mw1368 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:40] PROBLEM - Check systemd state on mc1039 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:40] PROBLEM - Check systemd state on db2098 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:42] PROBLEM - Check systemd state on mw2295 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:46] PROBLEM - Check systemd state on mw1303 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:51] (03PS1) 10Herron: ipmi::monitor: ensure /var/cache/ipmiseld directory exists [puppet] - 10https://gerrit.wikimedia.org/r/767223 (https://phabricator.wikimedia.org/T302639) [17:12:52] PROBLEM - Check systemd state on ganeti1006 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:54] PROBLEM - Check systemd state on backup2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:56] PROBLEM - Check systemd state on analytics1061 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:56] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:56] PROBLEM - Check systemd state on backup2002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:58] PROBLEM - Check systemd state on cp1084 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:58] PROBLEM - Check systemd state on mc1051 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:58] PROBLEM - Check systemd state on ganeti1010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:04] PROBLEM - Check systemd state on mw1314 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:04] PROBLEM - Check systemd state on db1132 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:10] PROBLEM - Check systemd state on an-tool1010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:10] PROBLEM - Check systemd state on mw2285 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:12] PROBLEM - Check systemd state on mw1359 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:12] PROBLEM - Check systemd state on mw1358 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:16] PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:16] PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:24] PROBLEM - Check systemd state on mw1319 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:24] PROBLEM - Check systemd state on mw1323 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:24] PROBLEM - Check systemd state on cp1090 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:28] PROBLEM - Check systemd state on mw1370 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:30] PROBLEM - Check systemd state on dbprov2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:30] PROBLEM - Check systemd state on kafka-jumbo1005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:32] PROBLEM - Check systemd state on mw2363 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:36] PROBLEM - Check systemd state on ganeti3001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:42] PROBLEM - Check systemd state on mw1363 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:42] PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:44] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:46] PROBLEM - Check systemd state on analytics1073 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:46] PROBLEM - Check systemd state on mw1425 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:46] PROBLEM - Check systemd state on mw1439 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:50] PROBLEM - Check systemd state on mc2031 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:52] PROBLEM - Check systemd state on mw1392 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:54] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:56] PROBLEM - Check systemd state on mw2319 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:58] PROBLEM - Check systemd state on mw2358 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:58] PROBLEM - Check systemd state on parse2010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:04] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:06] PROBLEM - Check systemd state on mc2035 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:07] PROBLEM - Check systemd state on restbase2025 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:08] PROBLEM - Check systemd state on cp5013 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:14] PROBLEM - Check systemd state on krb2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:14] PROBLEM - Check systemd state on kafka-logging2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:14] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:16] PROBLEM - Check systemd state on mc-gp2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:16] PROBLEM - Check systemd state on mw2281 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:22] PROBLEM - Check systemd state on ganeti1005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:22] PROBLEM - Check systemd state on mw2258 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:22] PROBLEM - Check systemd state on parse2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:24] PROBLEM - Check systemd state on dns3002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:24] PROBLEM - Check systemd state on mw2288 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:24] PROBLEM - Check systemd state on aqs1015 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:24] PROBLEM - Check systemd state on ganeti2019 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:28] PROBLEM - Check systemd state on mw2357 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:32] PROBLEM - Check systemd state on mw2378 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:32] PROBLEM - Check systemd state on mw1316 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:34] PROBLEM - Check systemd state on mw1423 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:34] PROBLEM - Check systemd state on mw1349 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:36] PROBLEM - Check systemd state on mw1442 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:40] PROBLEM - Check systemd state on mw2296 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21621 and previous config saved to /var/cache/conftool/dbconfig/20220301-171441-ladsgroup.json [17:14:42] PROBLEM - Check systemd state on mw2308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:44] PROBLEM - Check systemd state on cp3064 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:44] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [17:14:46] PROBLEM - Check systemd state on an-worker1082 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:50] PROBLEM - Check systemd state on mw1385 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:50] PROBLEM - Check systemd state on snapshot1009 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:52] PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:56] PROBLEM - Check systemd state on db2097 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:00] PROBLEM - Check systemd state on an-worker1140 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:06] PROBLEM - Check systemd state on cp5014 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:07] (03CR) 10Volans: "Structure LGTM, some missing bits inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [17:15:14] PROBLEM - Check systemd state on an-worker1109 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:20] PROBLEM - Check systemd state on mw2386 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:20] PROBLEM - Check systemd state on an-conf1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:22] PROBLEM - Check systemd state on lvs3007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:24] PROBLEM - Check systemd state on mw1311 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:26] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:28] PROBLEM - Check systemd state on mw1416 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:30] PROBLEM - Check systemd state on aqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:32] PROBLEM - Check systemd state on mw2255 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:32] PROBLEM - Check systemd state on mw1403 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:36] PROBLEM - Check systemd state on mw1426 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:36] PROBLEM - Check systemd state on backup1004 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:40] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:42] PROBLEM - Check systemd state on dns3001 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:44] PROBLEM - Check systemd state on mw1326 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:44] PROBLEM - Check systemd state on an-worker1079 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:44] PROBLEM - Check systemd state on an-worker1115 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:46] PROBLEM - Check systemd state on backup2006 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:46] PROBLEM - Check systemd state on mc1052 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:47] PROBLEM - Check systemd state on clouddb1013 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:54] PROBLEM - Check systemd state on mw1309 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:56] PROBLEM - Check systemd state on db1157 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:56] PROBLEM - Check systemd state on mw1373 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:56] PROBLEM - Check systemd state on mw1364 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:00] PROBLEM - Check systemd state on cp4025 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:00] PROBLEM - Check systemd state on ganeti2010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:02] PROBLEM - Check systemd state on cp6010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:04] PROBLEM - Check systemd state on db2141 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:04] PROBLEM - Check systemd state on mw2354 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:04] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:06] PROBLEM - Check systemd state on an-druid1003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:08] PROBLEM - Check systemd state on lvs4006 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:08] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:10] PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:14] PROBLEM - Check systemd state on mw2310 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:16] PROBLEM - Check systemd state on logstash1010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:18] PROBLEM - Check systemd state on cp3055 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:22] PROBLEM - Check systemd state on cp6005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:24] PROBLEM - Check systemd state on mw2373 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:24] PROBLEM - Check systemd state on cp4028 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:26] PROBLEM - Check systemd state on cp6003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:27] PROBLEM - Check systemd state on mw1375 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:30] PROBLEM - Check systemd state on puppetmaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:32] PROBLEM - Check systemd state on restbase2024 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:34] PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:34] PROBLEM - Check systemd state on mw2265 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:36] PROBLEM - Check systemd state on mw1384 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:36] PROBLEM - Check systemd state on mw1448 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:38] PROBLEM - Check systemd state on ganeti5002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:38] PROBLEM - Check systemd state on mw1446 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:40] PROBLEM - Check systemd state on ganeti4002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:46] PROBLEM - Check systemd state on mw2277 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:50] PROBLEM - Check systemd state on mw1404 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:52] PROBLEM - Check systemd state on mw1432 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:54] PROBLEM - Check systemd state on restbase1025 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:56] PROBLEM - Check systemd state on ms-backup1002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:56] PROBLEM - Check systemd state on cp6002 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:00] PROBLEM - Check systemd state on mw2336 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21622 and previous config saved to /var/cache/conftool/dbconfig/20220301-171701-ladsgroup.json [17:17:02] PROBLEM - Check systemd state on mw2329 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:03] I just started a schema change, that can't be possibly it [17:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:04] PROBLEM - Check systemd state on mw2388 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:04] PROBLEM - Check systemd state on lvs2010 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:04] PROBLEM - Check systemd state on mw1312 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:06] PROBLEM - Check systemd state on mw2287 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:06] PROBLEM - Check systemd state on druid1007 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:06] PROBLEM - Check systemd state on lvs3006 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:07] PROBLEM - Check systemd state on cp2034 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:10] PROBLEM - Check systemd state on an-druid1005 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:38] !log stopped ircecho on alert1001 due to systemd unit alert shower [17:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:46] dancy: hey, I'm free early as it happens, drop me a link any time :) [17:18:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/765624 plz [17:20:19] (03CR) 10RLazarus: [C: 03+2] scap.cfg.erb: Add container image build settings [puppet] - 10https://gerrit.wikimedia.org/r/765624 (https://phabricator.wikimedia.org/T297673) (owner: 10Ahmon Dancy) [17:20:19] herron: FYI it will be restarted by next puppet run [17:20:48] volans: puppet is disabled for now [17:20:55] ack [17:20:56] rzl: Gracias [17:21:09] dancy: merged -- want a manual puppet run on mwdebug or anything? [17:21:21] on deploymet.eqiad.wmnet [17:21:22] (03CR) 10Elukey: [C: 03+1] Add insetup role for ML staging etcd VMs [puppet] - 10https://gerrit.wikimedia.org/r/767221 (https://phabricator.wikimedia.org/T302503) (owner: 10Klausman) [17:23:44] dancy: done on deploy1002, assume that was the right place :) [17:23:53] yeah.. perfect.. thank you! [17:24:03] !log dancy@deploy1002 Finished scap: testing container image build (duration: 28m 39s) [17:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:30] sure thing! let me know if there's anything else [17:25:46] (03CR) 10Klausman: [C: 03+2] Add insetup role for ML staging etcd VMs [puppet] - 10https://gerrit.wikimedia.org/r/767221 (https://phabricator.wikimedia.org/T302503) (owner: 10Klausman) [17:30:23] (03PS1) 10JMeybohm: admin: add tmlt-tmager to krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/767226 (https://phabricator.wikimedia.org/T301679) [17:30:25] (03PS1) 10JMeybohm: admin: add damiendf to krb & analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/767227 (https://phabricator.wikimedia.org/T301659) [17:31:59] (03CR) 10Cwhite: [C: 03+1] ipmi::monitor: ensure /var/cache/ipmiseld directory exists [puppet] - 10https://gerrit.wikimedia.org/r/767223 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [17:32:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21624 and previous config saved to /var/cache/conftool/dbconfig/20220301-173206-ladsgroup.json [17:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:17] (03CR) 10CDanis: [C: 03+1] ipmi::monitor: ensure /var/cache/ipmiseld directory exists [puppet] - 10https://gerrit.wikimedia.org/r/767223 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [17:32:54] (03CR) 10Herron: [C: 03+2] ipmi::monitor: ensure /var/cache/ipmiseld directory exists [puppet] - 10https://gerrit.wikimedia.org/r/767223 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [17:34:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21625 and previous config saved to /var/cache/conftool/dbconfig/20220301-174711-ladsgroup.json [17:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:24] !log upgrade grafana in eqiad T282863 [17:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:26] T282863: Upgrade Grafana to 8.x - https://phabricator.wikimedia.org/T282863 [17:48:45] cwhite: should we (wmcs) be taking any action on the grafana-labs.wm.o grafana instances? [17:50:33] !log re-enabling puppet and ircecho on alert1001 [17:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:06] !log completed grafana upgrade in eqiad T282863 [17:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:48] taavi: I'm not sure about that instance. Probably up to wmcs folks about what to do. [17:55:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:57:45] cwhite: ok, in that case my question basically turns to "how do we upgrade it?" I don't see any magic hiera switches and buster-wikimedia/thirdparty/grafana only has 7.5 packages (and I don't see a 8.x component) [17:58:20] (03CR) 10Ottomata: "This will change these settings for all our wmf airflow instances. This is the right place to change them if that is the intention." [puppet] - 10https://gerrit.wikimedia.org/r/767220 (https://phabricator.wikimedia.org/T300870) (owner: 10Aqu) [17:59:59] taavi: packages to be uploaded shortly. Install is pretty straight forward, I'll add the procedure I used on the task [18:02:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300992)', diff saved to https://phabricator.wikimedia.org/P21626 and previous config saved to /var/cache/conftool/dbconfig/20220301-180216-ladsgroup.json [18:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:22] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [18:05:24] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Accidentally unsubscribed everyone from open-glam mailing list - https://phabricator.wikimedia.org/T302816 (10Aklapper) @Scann: Feel free to report to upstream at https://gitlab.com/mailman/mailman/-/issues [18:08:43] PROBLEM - Check systemd state on cp6011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_intel_microcode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:49] PROBLEM - Disk space on cp6010 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=98%): /tmp 0 MB (0% inode=98%): /var/tmp 0 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp6010&var-datasource=drmrs+prometheus/ops [18:10:51] taavi: packages are up and instructons on task: T282863 [18:10:53] T282863: Upgrade Grafana to 8.x - https://phabricator.wikimedia.org/T282863 [18:10:55] PROBLEM - Check systemd state on cp6010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:10:58] thank you! [18:12:53] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:24:13] RECOVERY - Check systemd state on cp6011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:33] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:25:11] PROBLEM - traffic_server tls process restarted on cp6015 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6015&var-layer=tls [18:28:11] (03CR) 10Cwhite: [C: 03+1] "Building CI checks will probably be more difficult using this approach, but there's still room to catch unintentional module imports elsew" [puppet] - 10https://gerrit.wikimedia.org/r/766814 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [18:31:39] (03CR) 10Cwhite: [C: 03+1] alertmanager: open per-device librenms tasks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767179 (https://phabricator.wikimedia.org/T300836) (owner: 10Filippo Giunchedi) [18:37:46] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/765562 (owner: 10Muehlenhoff) [18:40:09] PROBLEM - Number of messages locally queued by purged for processing on cp6014 is CRITICAL: cluster=cache_text instance=cp6014 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [18:41:05] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Accidentally unsubscribed everyone from open-glam mailing list - https://phabricator.wikimedia.org/T302816 (10Scann) Done! https://gitlab.com/mailman/mailman/-/issues/983 [18:44:23] PROBLEM - Disk space on cp6011 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=98%): /tmp 0 MB (0% inode=98%): /var/tmp 0 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp6011&var-datasource=drmrs+prometheus/ops [18:45:02] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2008.codfw.wmnet with reason: Remove from Ganeti cluster for decom [18:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2008.codfw.wmnet with reason: Remove from Ganeti cluster for decom [18:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:53] PROBLEM - Check systemd state on cp6011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:54] (03PS2) 10Muehlenhoff: Remove ganeti2008 [puppet] - 10https://gerrit.wikimedia.org/r/767182 (https://phabricator.wikimedia.org/T302078) [18:50:57] (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti2008 [puppet] - 10https://gerrit.wikimedia.org/r/767182 (https://phabricator.wikimedia.org/T302078) (owner: 10Muehlenhoff) [18:51:03] RECOVERY - Number of messages locally queued by purged for processing on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [18:54:10] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2008.codfw.wmnet [18:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:08] PROBLEM - Number of messages locally queued by purged for processing on cp6013 is CRITICAL: cluster=cache_text instance=cp6013 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [18:57:31] !log 1.38.0-wmf.24 train (T300200): there's currently a single blocker at T302643; staging to testwikis and holding there until backport's available [18:57:31] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34020/puppetmaster1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/766851 (owner: 10Dzahn) [18:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:35] T300200: 1.38.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T300200 [18:57:35] T302643: Beta Meta-Wiki throws an error on Special:Preferences: DomainException: HTMLForm::getField: no field named globalwatchlist-prefs - https://phabricator.wikimedia.org/T302643 [18:58:10] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [18:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:20] RECOVERY - Number of messages locally queued by purged for processing on cp6013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [19:00:04] brennen and dduvall: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T1900). [19:01:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:46] PROBLEM - SSH on cp6011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:03:59] o/ - currently working through staging [19:04:22] Don't be afraid. [19:04:25] That's great. [19:06:24] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:07:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2008.codfw.wmnet [19:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:57] 10ops-codfw, 10decommission-hardware: decommission ganeti2008 - https://phabricator.wikimedia.org/T302578 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Papaul [19:08:56] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp6011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:10:57] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767230 [19:10:59] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767230 (owner: 10Brennen Bearnes) [19:11:00] brennen: looking at that bug, if you're going to testwikis, why not go all group 0? It only effects testwiki + metawiki [19:11:08] https://phabricator.wikimedia.org/T302643#7739375 [19:11:28] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:40] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767230 (owner: 10Brennen Bearnes) [19:11:45] !log brennen@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.24 refs T300200 [19:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:48] T300200: 1.38.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T300200 [19:12:20] PROBLEM - Disk space on cp6013 is CRITICAL: DISK CRITICAL - free space: / 10036 MB (2% inode=98%): /tmp 10036 MB (2% inode=98%): /var/tmp 10036 MB (2% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp6013&var-datasource=drmrs+prometheus/ops [19:13:50] PROBLEM - Number of messages locally queued by purged for processing on cp6015 is CRITICAL: cluster=cache_text instance=cp6015 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [19:14:00] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp6011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:16:04] RECOVERY - Number of messages locally queued by purged for processing on cp6015 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [19:21:02] PROBLEM - traffic_server backend process restarted on cp6010 is CRITICAL: 8 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6010&var-layer=backend [19:22:38] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6011 is OK: HTTP OK: HTTP/1.0 200 OK - 25376 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:24:04] RhinosF1: hmm. metawiki isn't, however, in testwikis, right? [19:24:45] (03CR) 10Dzahn: [C: 03+2] wikistats: move repo from operations/debs on Gerrit to cloud on Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/766852 (owner: 10Dzahn) [19:24:51] (03PS3) 10Dzahn: wikistats: move repo from operations/debs on Gerrit to cloud on Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/766852 [19:24:59] brennen: That is correct. [19:25:01] brennen: meta is group1 though [19:25:58] ah, i see what you're saying. yeah, i'd had it in my head that it's group0. so yeah, going ahead probably makes sense if it legit only affects those two. [19:29:03] dancy: on sync-apaches and counter seems to be updating cleanly. [19:29:16] excellent. [19:29:22] what does in-flight max out at? [19:30:38] PROBLEM - traffic_server tls process restarted on cp6016 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6016&var-layer=tls [19:31:57] brennen: my limited understand says it should only [19:32:36] PROBLEM - Ensure local MW versions match expected deployment on mw1414 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:33:16] PROBLEM - Ensure local MW versions match expected deployment on mw1448 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:33:52] PROBLEM - Ensure local MW versions match expected deployment on mw1415 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:34:06] PROBLEM - Ensure local MW versions match expected deployment on mw1417 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:34:08] ^ unusual, normally a sign of something going wrong during deployment or that hosts are down but in scap [19:34:23] (03PS1) 10Krinkle: Revert "preferences: Use a faster and simpler form descriptor when validating" [core] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767089 (https://phabricator.wikimedia.org/T302643) [19:34:35] dancy: 80 [19:34:42] nod.. thx. [19:34:50] mutante: hrm [19:35:11] Those hosts were probably checked mid sync [19:35:16] RECOVERY - SSH on cp6011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:35:24] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:35:36] PROBLEM - Ensure local MW versions match expected deployment on mw1418 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:35:42] PROBLEM - Ensure local MW versions match expected deployment on mw1416 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:35:44] oh, but growing. [19:35:48] PROBLEM - Ensure local MW versions match expected deployment on mw1447 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:36:00] still only 3 mismatched? [19:36:01] one of them.. maybe.. but so many ... [19:36:07] !log mw1414 - scap pull [19:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:12] PROBLEM - Ensure local MW versions match expected deployment on mw1420 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:36:19] What exactly is the check.. checking? [19:36:21] PROBLEM - LVS text drmrs port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:36:38] here, looks like just drmrs so no urgent action needed but checking [19:36:38] PROBLEM - Ensure local MW versions match expected deployment on mw1306 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:36:40] PROBLEM - Ensure local MW versions match expected deployment on mw1449 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:36:58] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp6013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [19:37:02] Does anyone else pronounce that Doctor Mrs. ? [19:37:04] rzl: there is something wrong with deployment too but I agree the page seems to be drms and unrelated [19:37:08] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6013 is OK: HTTP OK: HTTP/1.0 200 OK - 25347 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:37:16] PROBLEM - Ensure local MW versions match expected deployment on mw1313 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:37:17] dancy: "doctor mrs." is my second favorite behind "dreamers" [19:37:24] dancy: that's what I said the first time but we call it "dreamers" [19:37:30] PROBLEM - traffic_server backend process restarted on cp6011 is CRITICAL: 291 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6011&var-layer=backend [19:37:32] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:37:36] PROBLEM - Ensure local MW versions match expected deployment on mw2254 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:37:55] wiki.willy has been quite fond of "Doctor Missus" during our weekly catch ups :) [19:38:05] first I'm hearing "dreamers" think I prefer that. [19:38:07] RECOVERY - LVS text drmrs port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.drmrs.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 610 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:38:11] anybody think this is a ctrl-c scap situation? [19:38:20] no [19:38:26] !log mw1449 - scap pull [19:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:40] PROBLEM - Ensure local MW versions match expected deployment on mw2380 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:38:53] drmrs page doesn't seem to be network related this time [19:39:14] PROBLEM - Ensure local MW versions match expected deployment on mw2386 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:39:26] PROBLEM - Ensure local MW versions match expected deployment on mw1425 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:39:32] PROBLEM - Ensure local MW versions match expected deployment on mw1307 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:39:38] PROBLEM - Ensure local MW versions match expected deployment on mw2374 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:39:42] PROBLEM - Ensure local MW versions match expected deployment on parse2001 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:39:48] PROBLEM - Ensure local MW versions match expected deployment on parse2016 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:39:50] PROBLEM - Ensure local MW versions match expected deployment on mw2300 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:39:58] PROBLEM - Ensure local MW versions match expected deployment on mw1385 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:39:58] PROBLEM - Ensure local MW versions match expected deployment on mw1309 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:40:05] did you deploy from codfw or something? [19:40:09] the check is usr/local/lib/nagios/plugins/check_mw_versions --deployhost deploy1002.eqiad.wmnet [19:40:17] it compares against deploy1002 [19:40:24] PROBLEM - Ensure local MW versions match expected deployment on mw1453 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:40:28] PROBLEM - traffic_server backend process restarted on cp6013 is CRITICAL: 5 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6013&var-layer=backend [19:40:37] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:40:40] PROBLEM - Ensure local MW versions match expected deployment on mw2409 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:40:42] i'm on deploy1002 [19:40:43] PROBLEM - Ensure local MW versions match expected deployment on mw1381 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:40:52] PROBLEM - Ensure local MW versions match expected deployment on mw1319 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:40:52] dancy: brennen: this is NOT fixed by a scap pull [19:40:53] so it's checking to see if hosts are out of sync w/ the deploy server (a normal state during the sync-apaches phase of scap) ? [19:40:56] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp6013 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:41:00] PROBLEM - Ensure local MW versions match expected deployment on mw1443 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:41:02] PROBLEM - Ensure local MW versions match expected deployment on mw1369 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:41:02] PROBLEM - Ensure local MW versions match expected deployment on mw1377 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:41:02] PROBLEM - Ensure local MW versions match expected deployment on mw1348 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:41:02] PROBLEM - Ensure local MW versions match expected deployment on wtp1037 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:41:04] Local version for labtestwiki is incorrect (local: php-1.38.0-wmf.24, official: php-1.38.0-wmf.23) [19:41:04] PROBLEM - Ensure local MW versions match expected deployment on mw2255 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:41:07] Local version for testwiki is incorrect (local: php-1.38.0-wmf.24, official: php-1.38.0-wmf.23) [19:41:08] PROBLEM - Ensure local MW versions match expected deployment on mw2411 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:41:10] Local version for testwikidatawiki is incorrect (local: php-1.38.0-wmf.24, official: php-1.38.0-wmf.23) [19:41:13] ^ can you fix the version on test wikis? [19:41:21] this is testwiki,labtestwiki and testwikidatawiki [19:41:29] that's the 3 versions that don't match and cause the alerts [19:41:33] PROBLEM - Ensure local MW versions match expected deployment on mw1366 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:41:45] hmm. [19:41:54] I think the check may be out of date with respect to some recent scap changes. [19:41:57] I'll look at it carefully . [19:41:58] PROBLEM - Ensure local MW versions match expected deployment on mw2271 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:41:59] seems like a step was missed that is normally "first deploy to test wikis" [19:42:00] PROBLEM - Ensure local MW versions match expected deployment on mw2264 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:42:00] PROBLEM - Ensure local MW versions match expected deployment on mw2304 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:42:20] PROBLEM - Ensure local MW versions match expected deployment on mw1341 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:42:20] PROBLEM - Ensure local MW versions match expected deployment on mw1304 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:42:22] recent scap changes was my first guess, given the version change [19:42:22] PROBLEM - Ensure local MW versions match expected deployment on mw1391 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:42:50] PROBLEM - Ensure local MW versions match expected deployment on mw1456 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:42:50] can you deploy but tell it "test wikis only" [19:42:58] PROBLEM - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:02] PROBLEM - Ensure local MW versions match expected deployment on snapshot1013 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:02] PROBLEM - Ensure local MW versions match expected deployment on mw1371 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:02] PROBLEM - Ensure local MW versions match expected deployment on mw1345 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:08] this is the deploy to testwikis stage, theoretically. [19:43:08] PROBLEM - Ensure local MW versions match expected deployment on mw2370 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:26] PROBLEM - Ensure local MW versions match expected deployment on mw2403 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:28] ok, then the new scap version broke the monitoring [19:43:32] PROBLEM - Ensure local MW versions match expected deployment on labweb1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:32] PROBLEM - Ensure local MW versions match expected deployment on mw1380 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:32] PROBLEM - Ensure local MW versions match expected deployment on mw1337 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:40] PROBLEM - Ensure local MW versions match expected deployment on mw1435 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:48] PROBLEM - Ensure local MW versions match expected deployment on mw1336 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:48] PROBLEM - Ensure local MW versions match expected deployment on mw1346 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:43:52] PROBLEM - Ensure local MW versions match expected deployment on mw1370 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:44:22] I stopped the bot [19:44:29] thx. [19:45:35] !log alert1001 - disable puppet, systemctl stop ircecho - to stop bot spam, caused somehow by new scap version breaking "mw versions mismwatch" alerting - affects labtestwiki,testwiki,testwikidatawiki [19:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:38] mutante: Are any parameters passed to the script? [19:46:08] dancy: only the name of the deployment server [19:46:11] command[check_mw_wikiversion_difference]=/usr/local/lib/nagios/plugins/check_mw_versions --deployhost deploy1002.eqiad.wmnet [19:46:13] ok [19:46:42] DEPLOYMENTS_PATH = "/mediawiki/mediawiki/wikiversions.json" [19:46:42] LOCAL_VERSIONS_FILE = "/srv/mediawiki/wikiversions.json" [19:47:24] "labtestwiki": "php-1.38.0-wmf.24", [19:47:32] "testwiki": "php-1.38.0-wmf.24", [19:48:12] that DEPLOYMENTS_PATH looks off? [19:48:14] I have never seen a path /mediawiki/ [19:48:17] yeah [19:48:20] sus [19:49:44] that's used to make a HTTP request to deploy1002, not to read a file from the local filesystem [19:50:06] maybe it should use mediawiki-staging instead or something like that? [19:50:24] That means the path is relative to /srv/deployment on the deploy server [19:50:28] on test wikis the versions on the local wikiversions.json do not match what you get when asking deploy1002 [19:50:56] hmm `/srv/deployment/mediawiki/mediawiki/` [19:51:13] "labtestwiki": "php-1.38.0-wmf.23", [19:51:18] "testwiki": "php-1.38.0-wmf.23", [19:51:21] 23 vs 24 [19:51:43] and /srv/deployment/mediawiki is a symlink to /srv/mediawiki. That's where the change is. [19:51:46] the deployment servers thinks it is on .23 but the appservers see .24 [19:51:49] how? [19:52:22] old scap rsync to /srv/mediawiki early. new scap does not affect /srv/mediawiki at all unless the deploy server is listed as an install target, in which case it will get updated later. [19:52:38] (sorry for typos) [19:53:17] ok! so the root cause is mixing old and new scap? [19:54:03] No. It's just the changes in scap. That symlink was an unknown/unexpected bit [19:54:59] I think the best course of action is to reenable the alerts after they've cleared, which should be when the sync-world is done.. and I'll find ways to adjust the checker. [19:55:00] are we likely to see fallout besides the monitoring is my main question [19:55:26] dancy: seems reasonable [19:55:46] is sync-world ongoing? [19:56:02] brennen: I dont expect other problems. The change in question was already deployed and used last week. [19:56:12] mutante: yes [19:56:27] ok! I will refresh some of those icinga checks [19:56:29] dancy: ack, thanks. [19:59:29] Sorry for the noise everyone. [19:59:29] The price of progress. :-) [19:59:29] there's always something [19:59:29] nod. [20:01:12] a manual "scap pull" does not fix the issue [20:01:24] can a sync-world be different? [20:01:45] you'd have to scap pull on the deploy server [20:01:54] (pressing enter has reared its head again.) [20:02:05] I have to get to a meeting now.. damnit! [20:02:11] where does the deployment server pull from then? [20:02:37] (03CR) 10RLazarus: [C: 03+1] "Sorry for the late review -- looks great, thanks for adding this!" [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) (owner: 10AikoChou) [20:02:37] itself [20:03:18] ok, but doing that while sync-world is going on sounds like a potentially bad thing too [20:03:48] It's not a real problem. [20:03:52] trust me. :-) [20:05:03] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.24 refs T300200 (duration: 53m 17s) [20:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:07] T300200: 1.38.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T300200 [20:05:09] !log alert1001 - re-enabled puppet [20:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:20] not a problem unless it doesn't clear now. [20:06:01] mutante, dancy: guessing we'll need to silence that again to move to group0 [20:06:25] unless... oh, maybe the symlink thing already handled, it'll only happen the once? [20:07:00] (i am wrong.) [20:07:46] there isn't really silencing just this alert alone, unless I flood the channel with 350 ACKs [20:08:01] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:08:02] ah, yeah, that's a bit of a problem. [20:08:15] RECOVERY - Disk space on cp6011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp6011&var-datasource=drmrs+prometheus/ops [20:08:18] and I can't just kill the bot and then we miss those other alerts up there [20:10:03] but the good news is.. most of the version alerts have cleared [20:10:03] after I rescheduled them [20:10:03] PROBLEM - SSH on cp6013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:11:31] ^ we are looking at this [20:12:08] for the moment, i'm going to do a quick backport of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/767089 then we should figure out how we want to proceed to group0. i'm assuming alerts will reappear. [20:12:28] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "preferences: Use a faster and simpler form descriptor when validating" [core] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767089 (https://phabricator.wikimedia.org/T302643) (owner: 10Krinkle) [20:12:31] but to be clear: anything about drmrs (or hosts with 6xxx numbers, which are in drmrs), those alerts have no bearing on users, and are not related to any scap woes [20:12:36] 10SRE, 10observability, 10serviceops: aggregate mismatched wikiversions alert - https://phabricator.wikimedia.org/T302832 (10CDanis) [20:12:40] bblack: we should make them not page in the meantime [20:12:43] thanks sukhe, bblack. [20:13:40] cdanis: there's kind of a chicken-and-egg we're in at this stage: we'd rather have them paging, because we want users to be on this in the near future. If it's not paging, we don't know what's unstable there, and we can't build enough comfort in its stability to ever put users there. [20:13:55] bblack: icinga has logs [20:14:01] paging means we interrupt the flow of ~50 people [20:14:19] PROBLEM - purged service on cp6011 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:14:19] PROBLEM - purged service on cp6012 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:14:29] PROBLEM - purged service on cp6007 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:14:47] PROBLEM - purged service on cp6008 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:14:47] PROBLEM - purged service on cp6006 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:15:03] cdanis: yes but that logic only works if the only potential causes of any instability lie with the people who will bother reading the logs, or something like that. [20:15:03] (03Abandoned) 10Bking: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [20:15:21] the point is, we don't plan to put users there and then turn on alerting or paging, it's the other way around, and we're very close now. [20:15:23] PROBLEM - purged service on cp6009 is CRITICAL: CRITICAL - Expecting active but unit purged is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:15:23] PROBLEM - purged service on cp6001 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:15:23] PROBLEM - purged service on cp6004 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:15:37] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:15:49] PROBLEM - purged service on cp6005 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:15:59] (03PS1) 10Dzahn: mediawiki: disable version monitoring [puppet] - 10https://gerrit.wikimedia.org/r/767237 [20:16:13] PROBLEM - purged service on cp6002 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:16:13] PROBLEM - purged service on cp6003 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:17:42] brennen: dancy: this is one way https://gerrit.wikimedia.org/r/c/operations/puppet/+/767237/1/modules/profile/manifests/mediawiki/monitor_versions.pp [20:17:45] PROBLEM - Check systemd state on cp6009 is CRITICAL: CRITICAL - degraded: The following units failed: purged.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:05] PROBLEM - purged service on cp6014 is CRITICAL: CRITICAL - Expecting active but unit purged is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:18:11] PROBLEM - Check systemd state on cp6014 is CRITICAL: CRITICAL - degraded: The following units failed: purged.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:30] mutante: lgtm as a temporary step [20:18:31] well. since we have all these cp6 alerts I give up on that [20:18:42] does not seem to matter [20:19:15] fair enough. [20:19:21] ? [20:22:29] RECOVERY - purged service on cp6011 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:22:31] RECOVERY - purged service on cp6012 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:22:35] PROBLEM - Check systemd state on cp6010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-exporter-apt.service,systemd-journald-audit.socket,systemd-journald-dev-log.socket,systemd-journald.service,systemd-journald.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:39] RECOVERY - purged service on cp6007 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:22:57] RECOVERY - purged service on cp6006 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:22:57] RECOVERY - purged service on cp6008 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:23:11] RECOVERY - Check systemd state on cp6009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:23] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp6016 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:23:33] RECOVERY - purged service on cp6009 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:23:33] RECOVERY - purged service on cp6001 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:23:33] RECOVERY - purged service on cp6004 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:23:33] RECOVERY - purged service on cp6014 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:23:41] RECOVERY - Check systemd state on cp6014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:43] bblack: I understand where you're coming from but I think paging is very distinct from alerting, especially in our environment [20:23:59] RECOVERY - purged service on cp6005 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:24:23] RECOVERY - purged service on cp6002 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:24:23] RECOVERY - purged service on cp6003 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:25:37] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:26:01] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp6016 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:27:12] (03Merged) 10jenkins-bot: Revert "preferences: Use a faster and simpler form descriptor when validating" [core] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767089 (https://phabricator.wikimedia.org/T302643) (owner: 10Krinkle) [20:30:49] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.24/includes: Backport: [[gerrit:767089|Revert "preferences: Use a faster and simpler form descriptor when validating" (T302643)]] (duration: 00m 55s) [20:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:53] T302643: Beta Meta-Wiki throws an error on Special:Preferences: DomainException: HTMLForm::getField: no field named globalwatchlist-prefs - https://phabricator.wikimedia.org/T302643 [20:33:16] !log 1.38.0-wmf.24 train (T300200): no current blockers; proceeding to group0; note this may briefly trigger some version alerts [20:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:20] T300200: 1.38.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T300200 [20:33:35] PROBLEM - Number of messages locally queued by purged for processing on cp6015 is CRITICAL: cluster=cache_text instance=cp6015 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [20:33:48] (03PS1) 10Brennen Bearnes: group0 wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767240 [20:33:50] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767240 (owner: 10Brennen Bearnes) [20:34:27] PROBLEM - Number of messages locally queued by purged for processing on cp6016 is CRITICAL: cluster=cache_text instance=cp6016 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [20:34:39] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767240 (owner: 10Brennen Bearnes) [20:36:03] PROBLEM - Number of messages locally queued by purged for processing on cp6011 is CRITICAL: cluster=cache_text instance=cp6011 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [20:36:09] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.24 refs T300200 [20:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:13] RECOVERY - Number of messages locally queued by purged for processing on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [20:39:03] RECOVERY - Number of messages locally queued by purged for processing on cp6015 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [20:40:13] RECOVERY - Disk space on cp6010 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp6010&var-datasource=drmrs+prometheus/ops [20:41:13] I'm back! [20:41:17] * dancy scrolls [20:41:39] RECOVERY - Number of messages locally queued by purged for processing on cp6011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [20:56:10] (03PS2) 10Dzahn: mediawiki: disable version monitoring [puppet] - 10https://gerrit.wikimedia.org/r/767237 (https://phabricator.wikimedia.org/T302832) [20:56:15] PROBLEM - Number of messages locally queued by purged for processing on cp6016 is CRITICAL: cluster=cache_text instance=cp6016 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [20:58:51] RECOVERY - traffic_server backend process restarted on cp6011 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6011&var-layer=backend [21:00:04] RoanKattouw and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T2100). [21:00:04] No Gerrit patches in the queue for this window AFAICS. [21:00:41] PROBLEM - Number of messages locally queued by purged for processing on cp6014 is CRITICAL: cluster=cache_text instance=cp6014 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [21:00:47] RECOVERY - traffic_server tls process restarted on cp6011 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6011&var-layer=tls [21:01:39] RECOVERY - Number of messages locally queued by purged for processing on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [21:06:09] RECOVERY - Number of messages locally queued by purged for processing on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [21:06:39] RECOVERY - SSH on cp6013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:09:27] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp6013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [21:10:35] muntante: Are you available for some live hacking on deploy1002's apache config? [21:11:45] PROBLEM - Host cp6013 is DOWN: PING CRITICAL - Packet loss = 100% [21:12:51] RECOVERY - Host cp6013 is UP: PING OK - Packet loss = 0%, RTA = 85.24 ms [21:12:53] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp6013 is OK: OK - Certificate *.wikipedia.org will expire on Thu 17 Nov 2022 11:59:59 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:13:19] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6013 is OK: HTTP OK: HTTP/1.1 200 Ok - 33532 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:13:35] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp6013 is OK: HTTP OK: HTTP/1.0 200 OK - 22709 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:14:15] RECOVERY - traffic_server backend process restarted on cp6013 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6013&var-layer=backend [21:14:27] RECOVERY - Check systemd state on cp6013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:53] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp6013 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 287106 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 33 days) https://wikitech.wikimedia.org/wiki/HTTPS [21:14:59] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6013 is OK: HTTP OK: HTTP/1.0 200 OK - 25431 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:16:29] PROBLEM - Confd vcl based reload on cp6015 is CRITICAL: reload-vcl failed to run since 0h, 7 minutes. https://wikitech.wikimedia.org/wiki/Varnish [21:17:41] RECOVERY - Check systemd state on cp6011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:23] PROBLEM - Ensure local MW versions match expected deployment on mw1440 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:23] PROBLEM - Ensure local MW versions match expected deployment on mw1455 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:23] PROBLEM - Ensure local MW versions match expected deployment on mw2305 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:23] PROBLEM - Ensure local MW versions match expected deployment on mw2314 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:23] PROBLEM - Ensure local MW versions match expected deployment on mw2295 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:23] PROBLEM - Ensure local MW versions match expected deployment on mw2368 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:25] PROBLEM - Ensure local MW versions match expected deployment on mw2363 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:37] PROBLEM - Ensure local MW versions match expected deployment on mwdebug2002 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:41] PROBLEM - Ensure local MW versions match expected deployment on mw1413 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:45] PROBLEM - Ensure local MW versions match expected deployment on mw1400 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:18:47] hmm [21:19:17] PROBLEM - Ensure local MW versions match expected deployment on mw1390 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:19:23] PROBLEM - Ensure local MW versions match expected deployment on mw2293 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:19:53] PROBLEM - Ensure local MW versions match expected deployment on mw1412 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:19:55] PROBLEM - Ensure local MW versions match expected deployment on mw2290 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:20:13] PROBLEM - Ensure local MW versions match expected deployment on mw2288 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:20:23] PROBLEM - Ensure local MW versions match expected deployment on mw1340 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:20:23] PROBLEM - Ensure local MW versions match expected deployment on labweb1002 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:20:23] PROBLEM - Ensure local MW versions match expected deployment on labweb1001 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:20:23] PROBLEM - Ensure local MW versions match expected deployment on mw1339 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:20:23] PROBLEM - Ensure local MW versions match expected deployment on mw1335 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:20:23] PROBLEM - Ensure local MW versions match expected deployment on mw1354 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:20:23] PROBLEM - Ensure local MW versions match expected deployment on mw1336 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:20:23] (03PS1) 10Ahmon Dancy: check_mw_versions.py: Fix problem induced by recent scap changes [puppet] - 10https://gerrit.wikimedia.org/r/767242 [21:20:34] well that's no good [21:20:56] (03CR) 10jerkins-bot: [V: 04-1] check_mw_versions.py: Fix problem induced by recent scap changes [puppet] - 10https://gerrit.wikimedia.org/r/767242 (owner: 10Ahmon Dancy) [21:21:14] damn whitespace [21:21:43] RECOVERY - Disk space on cp6013 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp6013&var-datasource=drmrs+prometheus/ops [21:21:47] (03PS2) 10Ahmon Dancy: check_mw_versions.py: Fix problem induced by recent scap changes [puppet] - 10https://gerrit.wikimedia.org/r/767242 [21:22:05] PROBLEM - Number of messages locally queued by purged for processing on cp6015 is CRITICAL: cluster=cache_text instance=cp6015 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [21:22:11] PROBLEM - Ensure local MW versions match expected deployment on mw2265 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:24:51] RECOVERY - Number of messages locally queued by purged for processing on cp6015 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [21:25:07] PROBLEM - Ensure local MW versions match expected deployment on parse2019 is CRITICAL: CRITICAL: 128 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:33:38] jouncebot now [21:33:38] For the next 0 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220301T2100) [21:35:45] I'm going to try sync-world again to see if it clears those lingering alertsd [21:36:02] !log dancy@deploy1002 Started scap: Resync to try to clear alerts [21:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:21] PROBLEM - Number of messages locally queued by purged for processing on cp6013 is CRITICAL: cluster=cache_text instance=cp6013 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [21:37:24] dancy: in theory scap sync-wikiversions should be enough to clear 'em [21:39:45] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:41:05] !log dancy@deploy1002 Started scap: Resync to try to clear alerts [21:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:51] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:47:39] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:47:48] so first eqsin and then ulsfo [21:47:49] checking [21:48:29] sukhe: nothing on the scheduled maintenance calendar [21:48:37] but it's very likely to be a ulsfo<>eqsin transport link [21:49:11] yeah, the transport link from SingTel [21:49:13] yeah [21:49:38] I'll contact them [21:49:47] ok thanks, I was reading Wikitech on router interface down [21:49:51] seems like not a cause for concern though [21:50:04] not immediate, no [21:50:47] thanks [21:52:07] (03CR) 10Bking: elastic: prevent rundir from deletion (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [21:53:14] !log dancy@deploy1002 Finished scap: Resync to try to clear alerts (duration: 12m 08s) [21:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:29] RECOVERY - Ensure local MW versions match expected deployment on mw1412 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:53:31] RECOVERY - Ensure local MW versions match expected deployment on mw2290 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:53:49] RECOVERY - Ensure local MW versions match expected deployment on mw2288 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:53:59] RECOVERY - Ensure local MW versions match expected deployment on mw1339 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:53:59] RECOVERY - Ensure local MW versions match expected deployment on labweb1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:53:59] RECOVERY - Ensure local MW versions match expected deployment on mw1352 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:53:59] RECOVERY - Ensure local MW versions match expected deployment on labweb1001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:53:59] RECOVERY - Ensure local MW versions match expected deployment on mw1335 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:53:59] RECOVERY - Ensure local MW versions match expected deployment on mw1340 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:54:00] RECOVERY - Ensure local MW versions match expected deployment on mw1355 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:54:05] sigh [21:55:35] RECOVERY - Number of messages locally queued by purged for processing on cp6013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [21:55:49] RECOVERY - Ensure local MW versions match expected deployment on mw2265 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:58:51] RECOVERY - Ensure local MW versions match expected deployment on parse2019 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:58:53] RECOVERY - Ensure local MW versions match expected deployment on mw1455 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:58:53] RECOVERY - Ensure local MW versions match expected deployment on mw1440 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:58:53] RECOVERY - Ensure local MW versions match expected deployment on mw2314 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:58:53] RECOVERY - Ensure local MW versions match expected deployment on mw2305 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:58:53] RECOVERY - Ensure local MW versions match expected deployment on mw2295 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:58:53] RECOVERY - Ensure local MW versions match expected deployment on mw2368 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:58:55] RECOVERY - Ensure local MW versions match expected deployment on mw2363 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:59:05] RECOVERY - Ensure local MW versions match expected deployment on mwdebug2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:59:09] RECOVERY - Ensure local MW versions match expected deployment on mw1413 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:59:11] RECOVERY - Ensure local MW versions match expected deployment on mw1400 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:59:15] icinga-wm: welcome back [21:59:35] RECOVERY - Ensure local MW versions match expected deployment on mw1390 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [21:59:37] RECOVERY - Ensure local MW versions match expected deployment on mw2293 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [22:04:57] PROBLEM - Confd vcl based reload on cp6009 is CRITICAL: reload-vcl failed to run since 0h, 56 minutes. https://wikitech.wikimedia.org/wiki/Varnish [22:09:37] 10SRE, 10SRE Observability: SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10RLazarus) [22:12:41] ^ we have spent some time on the cp600* hosts to try to figure out the problem. we are not fully convinced with the solution so will need to debug it further [22:12:57] if the alerts come again, I will just downtime the cp hosts [22:22:37] PROBLEM - SSH on cp6016 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:32:37] !log T276198 disabling puppet on elastic1052.eqiad.wmnet to test failure condition (rebooting shortly) [22:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:40] T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 [22:33:37] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp6016.drmrs.wmnet with reason: debugging till we find the root cause of the purged OOM issue; no traffic served [22:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:39] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp6016.drmrs.wmnet with reason: debugging till we find the root cause of the purged OOM issue; no traffic served [22:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:49] !log T276198 rebooting elastic1052.eqiad.wmnet to test failure condition [22:37:10] !log T276198 rebooting elastic1052.eqiad.wmnet to test failure condition [22:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:15] dancy: back now, should https://gerrit.wikimedia.org/r/c/operations/puppet/+/767242 still be done? [22:39:22] (03PS3) 10Dzahn: check_mw_versions.py: Fix problem induced by recent scap changes [puppet] - 10https://gerrit.wikimedia.org/r/767242 (https://phabricator.wikimedia.org/T302832) (owner: 10Ahmon Dancy) [22:39:27] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Telia, AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:39:36] (03Abandoned) 10Dzahn: mediawiki: disable version monitoring [puppet] - 10https://gerrit.wikimedia.org/r/767237 (https://phabricator.wikimedia.org/T302832) (owner: 10Dzahn) [22:40:33] PROBLEM - Host elastic1052 is DOWN: PING CRITICAL - Packet loss = 100% [22:40:45] RECOVERY - Host elastic1052 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:41:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [22:42:53] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:44:50] Getting outage reports from a number of sources [22:46:03] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 647 probes of 738 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:46:11] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:36] ew grafana is in light mode [22:47:47] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:50] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [22:48:14] Oh good, it's not just me [22:48:27] o/ looking [22:48:45] hm I can't reach logstash either, encabulating [22:48:48] (03PS1) 10Brennen Bearnes: WIP: gitlab: enable agent server for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/767249 (https://phabricator.wikimedia.org/T283894) [22:48:53] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:55] I think ulsfo is having trouble [22:49:02] i can get enwp fine from UK [22:49:36] cdanis: that tracks, I can reach bast2002 but not 4003 [22:49:37] RECOVERY - SSH on cp6016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:49:42] cdanis: depool ulsfo? [22:49:46] Definitely ulsfo related, bast4003 isn't reachable [22:49:49] rzl: ++ [22:49:49] ^ what rzl said [22:49:55] ack, online as well [22:50:33] i can curl ulsfo from here [22:50:35] yeah, users from the west coast are reporting errors on Discord, nothing elsewhere [22:50:49] I'm still setting up tunnelencabulator, jhathaway or cdanis can you start the depool if you haven't already? [22:50:59] !log T276198 reenabled puppet on elastic1052.eqiad.wmnet [22:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:03] T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 [22:51:14] inflatador: mid-outage, stop work for now please [22:51:26] rzl ACK, we have stopped [22:51:26] rzl: not something I have done before, i.e. I need to setup tunnelencabulator as well [22:51:36] so cdanis would definitely be faster on the depool [22:51:59] ping to ulsfo fails [22:52:01] (I put my mtr in _security) [22:52:06] jhathaway: is ulsfo your nearest DC? if you can reach gerrit etc you don't need tunnelencabulator :) but if you haven't depooled a DC before no worries [22:52:23] okay I'm set up, going ahead [22:52:33] rzl: sounds good [22:52:34] cdanis++ this tool is easier than I thought [22:52:50] jhathaway: do you mind taking IC and starting a doc etc? [22:52:58] rzl: sure [22:53:13] I will be ready to depool in about 2 more minutes, my laptop is on the fritz [22:53:22] but I am hoping bgp reconverges for most users meanwhile [22:53:23] (03PS1) 10Legoktm: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/767250 [22:53:37] rzl: ^ [22:53:53] ack, on it [22:54:00] thanks lego [22:54:10] mtr,ping,curl @ https://phabricator.wikimedia.org/P21628 [22:54:30] (03CR) 10RLazarus: [C: 03+2] Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/767250 (owner: 10Legoktm) [22:55:15] authdns-update running [22:56:04] OK - authdns-update successful on all nodes! [22:56:08] now just awaiting DNS TTLs [22:56:45] legoktm: <3 [23:02:06] :D [23:02:08] I'm back now [23:02:19] me too :) thanks guys [23:02:33] confirmed mtr ends in codfw now and not ulsfo anymore [23:02:42] it worked for me the whole time though..oddly [23:08:09] TTLs should have mostly passed at this point and https://grafana-rw.wikimedia.org/d/000000180/varnish-http-requests looks mostly-but-not-quite recovered -- is anyone here still having trouble reaching wikis normally? [23:08:13] rate of incoming NELs now looks normal-ish. [23:09:05] Seddon: looking good on Discord? [23:09:09] PROBLEM - Number of messages locally queued by purged for processing on cp6009 is CRITICAL: cluster=cache_text instance=cp6009 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [23:09:53] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [23:11:36] cdanis: yep all good [23:11:51] RECOVERY - Number of messages locally queued by purged for processing on cp6009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [23:12:04] Seddon: awesome thanks <3 [23:12:06] legoktm: <3 again [23:12:39] :))) [23:18:24] legoktm: seconded [23:20:08] rzl funnily enough, I just got done watching your incident response video about an hr ago and I must say: you definitely practice what you preach! [23:20:51] ahaha [23:23:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 102 probes of 654 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:26:34] nice job all, makes me feel quite privileged to work here [23:33:05] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 644 probes of 738 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:36:55] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 59 probes of 654 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:42:47] PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 33313 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [23:50:21] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudcontrol1005, wdqs2003, cloudcontrol1003, wdqs1004, cloudcontrol1004, wdqs2002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [23:50:23] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: wdqs1004, wdqs2003, wdqs2002, cloudcontrol1004, cloudcontrol1003, cloudcontrol1005 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [23:50:23] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: wdqs2003, wdqs2002, cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, wdqs1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [23:51:47] inflatador: feel free to go back to touching stuff in prod, if you haven't already :D sorry to be late saying so