[00:00:40] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:13:30] <wikibugs>	 (03PS1) 10Dzahn: httpd: only load modules actually needed, further simplify config, add links [container/miscweb] - 10https://gerrit.wikimedia.org/r/698273 (https://phabricator.wikimedia.org/T281538)
[00:18:09] <mutante>	 !log backup1001 systemctl reload bacula-dir  fails
[00:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:41] <mutante>	 !log backup1001 - systemctl baclua-dir works again (restoring backup for non-existing host)
[00:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:25:21] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10Papaul)
[00:30:28] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install pc2011-pc2014 - https://phabricator.wikimedia.org/T282482 (10Papaul)
[00:33:08] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence (Consultation), 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul)
[00:52:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpd: only load modules actually needed, further simplify config, add links [container/miscweb] - 10https://gerrit.wikimedia.org/r/698273 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[00:53:33] <wikibugs>	 (03Merged) 10jenkins-bot: httpd: only load modules actually needed, further simplify config, add links [container/miscweb] - 10https://gerrit.wikimedia.org/r/698273 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[01:03:34] <wikibugs>	 (03PS2) 10Dzahn: static-bugzilla: add config to serve compressed HTML [container/miscweb] - 10https://gerrit.wikimedia.org/r/698070
[01:03:35] <wikibugs>	 (03PS2) 10Dzahn: static-bugzilla: add gzipped test file [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538)
[01:06:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:08:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:29:08] <icinga-wm>	 PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:39:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_internal_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:41:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210605T0700)
[08:21:42] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 70 probes of 626 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:27:26] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 39 probes of 626 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:33:12] <wikibugs>	 10SRE, 10netops: routinator: create gabage collection job - https://phabricator.wikimedia.org/T282469 (10ayounsi) a:03ayounsi It's out https://github.com/NLnetLabs/routinator/releases/tag/0.9.0  I'll give it a few days in case there is a bugfix release then I'll look at upgrading it.
[09:33:17] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: fix etcd connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/698228
[09:33:19] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mwdebug: add etcd servers, datacenter [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229
[09:39:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix etcd connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/698228 (owner: 10Giuseppe Lavagetto)
[09:41:21] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: fix etcd connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/698228 (owner: 10Giuseppe Lavagetto)
[09:44:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229 (owner: 10Giuseppe Lavagetto)
[13:01:33] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Lets go! And next week we can do some deployments :]" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 (owner: 10Hashar)
[13:07:37] <wikibugs>	 (03Merged) 10jenkins-bot: [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 (owner: 10Hashar)
[14:09:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10BVershbow_WMF) Thanks for the quick attention to this! :)
[14:28:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: add etcd servers, datacenter [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229 (owner: 10Giuseppe Lavagetto)
[14:30:48] <wikibugs>	 (03Merged) 10jenkins-bot: mwdebug: add etcd servers, datacenter [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229 (owner: 10Giuseppe Lavagetto)
[14:35:38] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[14:35:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:22] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe)
[14:51:55] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) After solving various problems with the deployment, the situation now is: ` curl -H 'Host: en.wikipedia.org' http://10.64.75.196:8080/wiki/Main_Page <br /> <b>Fatal er...
[15:21:22] <Amir1>	 !log delete mbox files of group D and E in mm2 (T282303)
[15:21:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:27] <stashbot>	 T282303: The Great Clean Up of Mailman2  - https://phabricator.wikimedia.org/T282303
[15:45:32] <icinga-wm>	 PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=misc file=mailman_queues.prom instance=lists1001 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[16:16:11] <Amir1>	 !log deleting all private archives of mm2. All are inaccessible now (T282303)
[16:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:19] <stashbot>	 T282303: The Great Clean Up of Mailman2  - https://phabricator.wikimedia.org/T282303
[16:31:00] <wikibugs>	 (03PS4) 10Ladsgroup: mailman: Drop absented files and packages [puppet] - 10https://gerrit.wikimedia.org/r/697635 (https://phabricator.wikimedia.org/T282303)
[16:31:02] <wikibugs>	 (03PS4) 10Ladsgroup: backup: Drop mm2 exclude backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303)
[16:31:27] <wikibugs>	 (03CR) 10Ladsgroup: backup: Drop mm2 exclude backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[16:41:10] <wikibugs>	 (03PS1) 10Ladsgroup: mailman: Drop lists3 role [puppet] - 10https://gerrit.wikimedia.org/r/698306 (https://phabricator.wikimedia.org/T282303)
[16:45:03] <wikibugs>	 (03PS1) 10Ladsgroup: prometheus: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698307 (https://phabricator.wikimedia.org/T273673)
[16:48:25] <wikibugs>	 (03PS1) 10Ladsgroup: rsync: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698308 (https://phabricator.wikimedia.org/T273673)
[16:50:32] <wikibugs>	 (03PS1) 10Ladsgroup: dumps: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698309 (https://phabricator.wikimedia.org/T273673)
[18:13:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:15:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:38:18] <icinga-wm>	 PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:14:50] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:38:52] <icinga-wm>	 RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:09:15] <ma>	 legoktm: are x-spam-score headers not present in mailman3?
[20:14:26] <icinga-wm>	 PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:15:58] <icinga-wm>	 RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook