[00:00:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:30] (03PS1) 10Dzahn: httpd: only load modules actually needed, further simplify config, add links [container/miscweb] - 10https://gerrit.wikimedia.org/r/698273 (https://phabricator.wikimedia.org/T281538) [00:18:09] !log backup1001 systemctl reload bacula-dir fails [00:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:41] !log backup1001 - systemctl baclua-dir works again (restoring backup for non-existing host) [00:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:21] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10Papaul) [00:30:28] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install pc2011-pc2014 - https://phabricator.wikimedia.org/T282482 (10Papaul) [00:33:08] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation), 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [00:52:08] (03CR) 10Dzahn: [C: 03+2] httpd: only load modules actually needed, further simplify config, add links [container/miscweb] - 10https://gerrit.wikimedia.org/r/698273 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [00:53:33] (03Merged) 10jenkins-bot: httpd: only load modules actually needed, further simplify config, add links [container/miscweb] - 10https://gerrit.wikimedia.org/r/698273 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [01:03:34] (03PS2) 10Dzahn: static-bugzilla: add config to serve compressed HTML [container/miscweb] - 10https://gerrit.wikimedia.org/r/698070 [01:03:35] (03PS2) 10Dzahn: static-bugzilla: add gzipped test file [container/miscweb] - 10https://gerrit.wikimedia.org/r/698079 (https://phabricator.wikimedia.org/T281538) [01:06:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:08:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:29:08] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:39:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_internal_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:41:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210605T0700) [08:21:42] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 70 probes of 626 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:27:26] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 39 probes of 626 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:33:12] 10SRE, 10netops: routinator: create gabage collection job - https://phabricator.wikimedia.org/T282469 (10ayounsi) a:03ayounsi It's out https://github.com/NLnetLabs/routinator/releases/tag/0.9.0 I'll give it a few days in case there is a bugfix release then I'll look at upgrading it. [09:33:17] (03PS2) 10Giuseppe Lavagetto: mediawiki: fix etcd connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/698228 [09:33:19] (03PS2) 10Giuseppe Lavagetto: mwdebug: add etcd servers, datacenter [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229 [09:39:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix etcd connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/698228 (owner: 10Giuseppe Lavagetto) [09:41:21] (03Merged) 10jenkins-bot: mediawiki: fix etcd connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/698228 (owner: 10Giuseppe Lavagetto) [09:44:00] (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229 (owner: 10Giuseppe Lavagetto) [13:01:33] (03CR) 10Hashar: [C: 03+2] "Lets go! And next week we can do some deployments :]" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 (owner: 10Hashar) [13:07:37] (03Merged) 10jenkins-bot: [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 (owner: 10Hashar) [14:09:40] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Ben Vershbow - https://phabricator.wikimedia.org/T284248 (10BVershbow_WMF) Thanks for the quick attention to this! :) [14:28:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: add etcd servers, datacenter [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229 (owner: 10Giuseppe Lavagetto) [14:30:48] (03Merged) 10jenkins-bot: mwdebug: add etcd servers, datacenter [deployment-charts] - 10https://gerrit.wikimedia.org/r/698229 (owner: 10Giuseppe Lavagetto) [14:35:38] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:22] 10SRE, 10MW-on-K8s, 10serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [14:51:55] 10SRE, 10MW-on-K8s, 10serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) After solving various problems with the deployment, the situation now is: ` curl -H 'Host: en.wikipedia.org' http://10.64.75.196:8080/wiki/Main_Page
Fatal er... [15:21:22] !log delete mbox files of group D and E in mm2 (T282303) [15:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:27] T282303: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 [15:45:32] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=misc file=mailman_queues.prom instance=lists1001 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [16:16:11] !log deleting all private archives of mm2. All are inaccessible now (T282303) [16:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:19] T282303: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 [16:31:00] (03PS4) 10Ladsgroup: mailman: Drop absented files and packages [puppet] - 10https://gerrit.wikimedia.org/r/697635 (https://phabricator.wikimedia.org/T282303) [16:31:02] (03PS4) 10Ladsgroup: backup: Drop mm2 exclude backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) [16:31:27] (03CR) 10Ladsgroup: backup: Drop mm2 exclude backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [16:41:10] (03PS1) 10Ladsgroup: mailman: Drop lists3 role [puppet] - 10https://gerrit.wikimedia.org/r/698306 (https://phabricator.wikimedia.org/T282303) [16:45:03] (03PS1) 10Ladsgroup: prometheus: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698307 (https://phabricator.wikimedia.org/T273673) [16:48:25] (03PS1) 10Ladsgroup: rsync: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698308 (https://phabricator.wikimedia.org/T273673) [16:50:32] (03PS1) 10Ladsgroup: dumps: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698309 (https://phabricator.wikimedia.org/T273673) [18:13:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:15:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:38:18] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:14:50] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:52] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:09:15] legoktm: are x-spam-score headers not present in mailman3? [20:14:26] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:15:58] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook