[00:05:10] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:06] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:20] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 49.83 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [00:32:54] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:50] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:10] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:03:04] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:13:50] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:29:10] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:48] PROBLEM - Check systemd state on prometheus5001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:44] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:28] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is CRITICAL: 115.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [02:34:58] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:38:48] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 13 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:59:10] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is CRITICAL: 133.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [03:27:52] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2044-production-search-psi-codfw on elastic2044 is OK: (C)100 gt (W)80 gt 70.17 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2044&panelId=37 [03:59:10] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2053-production-search-psi-codfw on elastic2053 is OK: (C)100 gt (W)80 gt 38.64 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&panelId=37 [04:54:26] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:56:20] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 13 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:18:58] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:20:48] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:59:24] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [06:01:16] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [06:14:08] (03CR) 10Elukey: "Thanks a lot! Left some questions in the patch, but it looks great." [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [06:38:23] (03CR) 10Ayounsi: [C: 03+1] Temporarily filter port 25 on mx2001 for reimage [homer/public] - 10https://gerrit.wikimedia.org/r/720277 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [06:49:32] PROBLEM - Check systemd state on cp5004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:04:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [07:10:36] (03CR) 10Majavah: [C: 04-1] Improve the Kerberos automatic renewal service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [07:16:50] RECOVERY - Check systemd state on cp5004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:33] (03CR) 10Muehlenhoff: [C: 03+2] Update puppetised java.security file for Java 11.0.12 [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff) [07:33:48] (03CR) 10Elukey: "Really great work, I have some optional suggestions:" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [07:37:56] (03CR) 10Volans: [C: 03+2] "Trivial typo fix, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/720371 (owner: 10Volans) [07:40:46] (03Merged) 10jenkins-bot: sre.experimental.reimage: fix typo in variable [cookbooks] - 10https://gerrit.wikimedia.org/r/720371 (owner: 10Volans) [07:56:04] (03CR) 10Elukey: "Another thing - let's run a pcc for multiple nodes (even non-analytics ones) to confirm the no-op" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [07:59:28] (03PS4) 10Rishabhbhat: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) [08:02:30] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-10-13 08:01:48 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [08:07:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice job!" [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [08:08:22] (03CR) 10Filippo Giunchedi: [C: 03+1] o11y: add rsyslog alerts [alerts] - 10https://gerrit.wikimedia.org/r/720063 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [08:08:55] (03CR) 10Filippo Giunchedi: [C: 03+1] logging: clean up legacy logstash alerts [puppet] - 10https://gerrit.wikimedia.org/r/720093 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [08:13:42] PROBLEM - Check systemd state on irc2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:49] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 5 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) a:03Nikerabbit [08:15:22] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:16:25] !log bump +100G prometheus/ops codfw [08:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:18] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:19:50] (03PS2) 10Ema: rsyslog: stop saving trafficserver logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) [08:21:15] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10fgiunchedi) FWIW I'm +1 and happy to assist/review/etc [08:21:40] (03CR) 10Ema: [C: 03+2] rsyslog: stop saving trafficserver logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [08:23:17] (03PS1) 10Volans: prometheus-puppet-agent-stats: temporary bandaid [puppet] - 10https://gerrit.wikimedia.org/r/720666 (https://phabricator.wikimedia.org/T290726) [08:23:26] godog: I've sent ^^^ to bandaid it (see irc2001 above) [08:23:50] as john is out this week so I guess any progress on the laternative plan will be done later on [08:25:03] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus-puppet-agent-stats: temporary bandaid [puppet] - 10https://gerrit.wikimedia.org/r/720666 (https://phabricator.wikimedia.org/T290726) (owner: 10Volans) [08:25:06] volans: thank you! LGTM [08:25:29] thank you! [08:25:40] (03CR) 10Volans: [C: 03+2] prometheus-puppet-agent-stats: temporary bandaid [puppet] - 10https://gerrit.wikimedia.org/r/720666 (https://phabricator.wikimedia.org/T290726) (owner: 10Volans) [08:26:43] (03PS1) 10DCausse: elasticsearch: Force creation of tmp files before restart [puppet] - 10https://gerrit.wikimedia.org/r/720667 (https://phabricator.wikimedia.org/T276198) [08:26:59] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10dcausse) a:03dcausse [08:27:12] RECOVERY - Check systemd state on irc2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:38] yay, it fixed it [08:28:19] although ofc now the sumary has the sha from my new commit, but I tested it there before so I know it worked :D [08:28:29] wonder why sometimes it looses it [08:29:01] yeah me too, I have this feeling though it isn't a rabbit hole I want to look into [08:29:25] I think the plan to ditch it and go with logs is good though [08:29:54] 10SRE, 10SRE-Access-Requests, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10akosiaris) a:03akosiaris [08:30:38] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10ayounsi) a:03DMburugu [08:32:14] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10cmooney) 05Open→03Resolved [08:32:27] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10cmooney) Gonna close this one given the lack of feedback. If there are any issues with access please open another task. [08:32:59] yeah agree [08:33:14] (03PS1) 10Muehlenhoff: DHCP: Switch mx2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/720668 [08:34:26] (03PS1) 10Elukey: Enable mmkubernetes (build depends on libcurl and liblognorm) and build rsyslog-kubernetes [debs/rsyslog] (debian/buster-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/720669 (https://phabricator.wikimedia.org/T206633) [08:34:31] (03PS1) 10Elukey: Suspend mmkubernetes on connection errors [debs/rsyslog] (debian/buster-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/720670 (https://phabricator.wikimedia.org/T289766) [08:36:50] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) I pushed manually to `debian/buster-wikimedia-k8s` the base `8.1901.0-1` version and published two patches starting from https://gerrit.wikimedia.org/r/c/operations/debs/rsyslog/+/720669/1 [08:40:12] (03PS1) 10Filippo Giunchedi: prometheus: temp exclude 'rails' from job availability alert [puppet] - 10https://gerrit.wikimedia.org/r/720671 (https://phabricator.wikimedia.org/T289454) [08:42:26] (03CR) 10JMeybohm: [C: 03+1] Enable mmkubernetes (build depends on libcurl and liblognorm) and build rsyslog-kubernetes [debs/rsyslog] (debian/buster-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/720669 (https://phabricator.wikimedia.org/T206633) (owner: 10Elukey) [08:42:31] (03CR) 10JMeybohm: [C: 03+1] Suspend mmkubernetes on connection errors [debs/rsyslog] (debian/buster-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/720670 (https://phabricator.wikimedia.org/T289766) (owner: 10Elukey) [08:48:26] (03PS4) 10Jelto: helmfile.d/admin make tiller components configurable per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) [08:49:54] (03PS2) 10Alexandros Kosiaris: Using full names instead of shorthands [software/benchmw] - 10https://gerrit.wikimedia.org/r/719103 [08:49:56] (03PS2) 10Alexandros Kosiaris: Fix title of load test [software/benchmw] - 10https://gerrit.wikimedia.org/r/719104 [08:49:59] (03PS3) 10Alexandros Kosiaris: Add the ability to generate comparisions of latency percentiles [software/benchmw] - 10https://gerrit.wikimedia.org/r/719105 [08:50:34] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] ""s c" stands for "smooth csplines". Added that too, will merge" [software/benchmw] - 10https://gerrit.wikimedia.org/r/719103 (owner: 10Alexandros Kosiaris) [08:50:59] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix title of load test [software/benchmw] - 10https://gerrit.wikimedia.org/r/719104 (owner: 10Alexandros Kosiaris) [08:52:05] (03PS1) 10MMandere: cloud: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/720672 (https://phabricator.wikimedia.org/T282787) [08:53:06] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "self merging. This works quite ok up to now for T280497 we can always improve later on." [software/benchmw] - 10https://gerrit.wikimedia.org/r/719105 (owner: 10Alexandros Kosiaris) [08:54:42] (03PS4) 10Alexandros Kosiaris: Fix 'load' title, add 'rl_startup', add 'parse_light' [software/benchmw] - 10https://gerrit.wikimedia.org/r/719608 (https://phabricator.wikimedia.org/T280497) (owner: 10Krinkle) [08:55:36] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10JMeybohm) I've added a couple of lines to https://wikitech.wikimedia.org/wiki/Rsyslog#Packaging I'm not sure what the source for `8.2008.0` is but if we want to keep it around we should potenti... [08:56:00] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix 'load' title, add 'rl_startup', add 'parse_light' (031 comment) [software/benchmw] - 10https://gerrit.wikimedia.org/r/719608 (https://phabricator.wikimedia.org/T280497) (owner: 10Krinkle) [08:57:35] (03PS2) 10Alexandros Kosiaris: Update bench urls and improve url labels [software/benchmw] - 10https://gerrit.wikimedia.org/r/720061 (owner: 10Krinkle) [09:00:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Given this, the current data we have generated and use for comparisons might not be as good as we thought. I 'll checkout this change and " [software/benchmw] - 10https://gerrit.wikimedia.org/r/720061 (owner: 10Krinkle) [09:00:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10cmooney) @odimitrijevic @Ottomata are you happy for this request to be approved? [09:00:11] (03CR) 10Muehlenhoff: [C: 03+2] DHCP: Switch mx2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/720668 (owner: 10Muehlenhoff) [09:03:58] (03CR) 10Jelto: helmfile.d/admin make tiller components configurable per environment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [09:05:30] (03PS5) 10Jelto: helmfile.d/admin make tiller components configurable per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) [09:05:37] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable mmkubernetes (build depends on libcurl and liblognorm) and build rsyslog-kubernetes [debs/rsyslog] (debian/buster-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/720669 (https://phabricator.wikimedia.org/T206633) (owner: 10Elukey) [09:06:09] (03CR) 10Filippo Giunchedi: [C: 03+1] Suspend mmkubernetes on connection errors [debs/rsyslog] (debian/buster-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/720670 (https://phabricator.wikimedia.org/T289766) (owner: 10Elukey) [09:07:22] (03CR) 10Elukey: [V: 03+2 C: 03+2] Enable mmkubernetes (build depends on libcurl and liblognorm) and build rsyslog-kubernetes [debs/rsyslog] (debian/buster-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/720669 (https://phabricator.wikimedia.org/T206633) (owner: 10Elukey) [09:07:50] (03CR) 10Elukey: [C: 03+2] Suspend mmkubernetes on connection errors [debs/rsyslog] (debian/buster-wikimedia-k8s) - 10https://gerrit.wikimedia.org/r/720670 (https://phabricator.wikimedia.org/T289766) (owner: 10Elukey) [09:08:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Disable the "long running screen/tmux session" check by default [puppet] - 10https://gerrit.wikimedia.org/r/712123 (https://phabricator.wikimedia.org/T288028) (owner: 10Muehlenhoff) [09:10:54] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) ` root@apt1001:/srv/wikimedia# reprepro lsbycomponent rsyslog rsyslog | 8.1901.0-1~bpo9+wmf2 | stretch-wikimedia | main | amd64, source rsyslog | 8.2008.0-1~bpo10+1 |... [09:11:50] !log upload rsyslog* 8.1901.0-1+wmf2 to buster-wikimedia component/rsyslog-k8s - T277739 [09:11:51] !log reimaging sretest1002 [09:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:55] T277739: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 [09:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/720672 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:13:09] (03PS10) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [09:14:55] (03PS1) 10JMeybohm: rsyslog/kubernetes: Enable resume/retry for mmkubernetes [puppet] - 10https://gerrit.wikimedia.org/r/720677 (https://phabricator.wikimedia.org/T289766) [09:16:02] (03PS1) 10Volans: sre.experimental.reimage: fix depool value [cookbooks] - 10https://gerrit.wikimedia.org/r/720678 [09:16:13] !log swift eqiad-prod: add weight to ms-be10[64-67] - T290546 [09:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:19] T290546: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 [09:18:07] !log upgrade rsyslog* on ml-serve* nodes to 8.1901.0-1+wmf2 [09:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:16] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31054/console" [puppet] - 10https://gerrit.wikimedia.org/r/720677 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [09:19:58] (03CR) 10Volans: [C: 03+2] "Trivial, self-merging to test it" [cookbooks] - 10https://gerrit.wikimedia.org/r/720678 (owner: 10Volans) [09:20:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline too" [puppet] - 10https://gerrit.wikimedia.org/r/720677 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [09:21:12] (03PS2) 10JMeybohm: rsyslog/kubernetes: Enable resume/retry for mmkubernetes [puppet] - 10https://gerrit.wikimedia.org/r/720677 (https://phabricator.wikimedia.org/T289766) [09:21:33] (03CR) 10JMeybohm: rsyslog/kubernetes: Enable resume/retry for mmkubernetes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/720677 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [09:21:43] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) 05Open→03Resolved a:03elukey [09:22:34] (03CR) 10Elukey: [C: 03+1] rsyslog/kubernetes: Enable resume/retry for mmkubernetes [puppet] - 10https://gerrit.wikimedia.org/r/720677 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [09:22:54] (03CR) 10JMeybohm: [C: 03+2] rsyslog/kubernetes: Enable resume/retry for mmkubernetes [puppet] - 10https://gerrit.wikimedia.org/r/720677 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [09:23:42] (03Merged) 10jenkins-bot: sre.experimental.reimage: fix depool value [cookbooks] - 10https://gerrit.wikimedia.org/r/720678 (owner: 10Volans) [09:24:03] (03CR) 10Jelto: [C: 03+1] "lgtm as a temporary solution, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/720671 (https://phabricator.wikimedia.org/T289454) (owner: 10Filippo Giunchedi) [09:25:37] (03PS1) 10Muehlenhoff: Enable ganeti216 component for ganeti-test [puppet] - 10https://gerrit.wikimedia.org/r/720682 [09:26:04] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: temp exclude 'rails' from job availability alert [puppet] - 10https://gerrit.wikimedia.org/r/720671 (https://phabricator.wikimedia.org/T289454) (owner: 10Filippo Giunchedi) [09:27:43] (03CR) 10Elukey: [C: 03+1] "LGTM, let's see what Janis thinks about it. Thanks a lot for this work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [09:30:52] godog: tcpircbot doesn't seem to rely messages here, the unit is running, got restarted at 4am UTC this morning [09:31:01] anything you might want to check before I bounce it? [09:31:22] PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:46] volans: yeah I'll take a very quick look, thanks for the heads up [09:32:01] (03CR) 10Hashar: [C: 04-1] profile::ci::slave::labs::common: move to cinder-based storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [09:32:04] it seemed to have reconnected to libera just fine [09:32:13] but you can see in the logs my few log lines [09:32:17] that didn't endup here [09:32:42] and indeed the bot is not in this chan [09:32:57] (03CR) 10Muehlenhoff: [C: 03+2] Enable ganeti216 component for ganeti-test [puppet] - 10https://gerrit.wikimedia.org/r/720682 (owner: 10Muehlenhoff) [09:33:12] (03CR) 10Jelto: [C: 03+1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:33:15] yeah :( feel free to bounce it, I don't think we can usefully inspect it further [09:33:18] RECOVERY - Host sretest1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:33:24] ack [09:33:44] !log restarting tcpircbot-logmsgbot on alert1001, not relying messages [09:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:10] hello logmsgbot [09:34:17] nor probably want to tbh, it has tech debt written all over it :| [09:34:26] yeah I recall :/ [09:35:23] I wonder if I threaten to rewrite it in a few lines of golang, then it'll start working flawlessly again [09:35:29] after all it is reading these lines [09:36:38] ahahah, nice try! [09:38:23] hehhe we'll see, but seriously though it shouldn't be too hard indeed to rewrite in golang IIRC [09:38:54] hackathon project ;) [09:41:02] yeah good idea! why not [09:42:03] (03CR) 10JMeybohm: [C: 03+1] helmfile.d/admin make tiller components configurable per environment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [09:44:29] (03PS6) 10Jelto: helmfile.d/admin make tiller components configurable per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) [09:45:16] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T290764 (10cmooney) Hi Jess, Can you confirm your login to wikitech.wikimedia.org works? I am having no luck searching for your account on that on the system here (it is however my first day doing access req... [09:46:55] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] echoserver: Add echoserver debug container [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/717412 (owner: 10JMeybohm) [09:47:34] (03PS1) 10DCausse: search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) [09:48:56] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T290764 (10Aklapper) @cmooney: https://wikitech.wikimedia.org/wiki/Special:Log?page=User:JKlein exists, not sure how you are searching? [09:49:51] (03CR) 10jerkins-bot: [V: 04-1] search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) (owner: 10DCausse) [09:51:41] (03CR) 10Jelto: helmfile.d/admin make tiller components configurable per environment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [09:54:34] (03PS3) 10DCausse: search-platform: add flink alerts [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) [09:54:36] (03PS2) 10DCausse: search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) [09:58:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/720667 (https://phabricator.wikimedia.org/T276198) (owner: 10DCausse) [10:00:56] (03PS1) 10Alexandros Kosiaris: Add comment for usage of underscores/spaces [software/benchmw] - 10https://gerrit.wikimedia.org/r/720685 [10:02:22] (03CR) 10jerkins-bot: [V: 04-1] search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) (owner: 10DCausse) [10:05:14] 10SRE, 10serviceops: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 (10JMeybohm) Failed to build an image today: ` 2021-09-13 09:51:42,980 [docker-pkg-build] ERROR - Build failed: devmapper: Thin Pool has 149221 free data blocks which is less than minimum required 163840 free d... [10:06:29] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T290764 (10cmooney) @Aklapper thanks for the confirmation. I'd been searching the wrong way (with ldapsearch), but I can see where I went wrong, it comes up if I search with cn. @iamjessklein leave it with... [10:13:15] (03CR) 10Giuseppe Lavagetto: "I would think "helm_version" would be a better label than "tillerEnabled", because it would be useful for other similar feature flags in t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [10:15:28] !log volans@cumin1001 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=93) for host mw1414.eqiad.wmnet [10:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:45] (03PS1) 10Cathal Mooney: admin: Add jjbk (Jess Klien) to the list of ldap-only-users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/720710 (https://phabricator.wikimedia.org/T290764) [10:19:37] (03CR) 10Giuseppe Lavagetto: "Given the +1 from alex and the PCC results, I will merge this and verify it works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [10:19:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [10:19:59] (03CR) 10Kosta Harlan: WikimediaEvents: Remove UnderstandingFirstDay config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713553 (owner: 10Kosta Harlan) [10:20:05] (03PS2) 10Kosta Harlan: WikimediaEvents: Remove UnderstandingFirstDay config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713553 [10:20:25] (03PS3) 10Kosta Harlan: WikimediaEvents: Remove UnderstandingFirstDay config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713553 [10:24:15] (03CR) 10JMeybohm: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:25:32] (03PS1) 10Volans: sre.experimental.reimage: bugfixes [cookbooks] - 10https://gerrit.wikimedia.org/r/720712 [10:26:38] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:28:52] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T1030). [10:30:22] (03PS7) 10Jelto: helmfile.d/admin make tiller components configurable per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) [10:35:16] (03CR) 10Jelto: helmfile.d/admin make tiller components configurable per environment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [10:36:31] (03PS1) 10Muehlenhoff: Change ganeti216 component to only add the apt source [puppet] - 10https://gerrit.wikimedia.org/r/720716 [10:38:30] (03CR) 10Volans: [C: 03+2] "Another round of trivial bug fixes from last run." [cookbooks] - 10https://gerrit.wikimedia.org/r/720712 (owner: 10Volans) [10:39:55] (03CR) 10Ayounsi: [C: 03+1] admin: Add jjbk (Jess Klien) to the list of ldap-only-users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/720710 (https://phabricator.wikimedia.org/T290764) (owner: 10Cathal Mooney) [10:41:39] (03Merged) 10jenkins-bot: sre.experimental.reimage: bugfixes [cookbooks] - 10https://gerrit.wikimedia.org/r/720712 (owner: 10Volans) [10:43:46] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2002.codfw.wmnet [10:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:12] (03CR) 10JMeybohm: [C: 04-1] kubernetes: add revscoring-editquality in the services configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:45:17] (03CR) 10Muehlenhoff: [C: 03+2] Change ganeti216 component to only add the apt source [puppet] - 10https://gerrit.wikimedia.org/r/720716 (owner: 10Muehlenhoff) [10:47:06] (03CR) 10Btullis: Improve the Kerberos automatic renewal service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [10:47:19] /quote cs op #wikimedia-operations effie [10:47:22] (03CR) 10Cathal Mooney: [C: 03+2] admin: Add jjbk (Jess Klien) to the list of ldap-only-users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/720710 (https://phabricator.wikimedia.org/T290764) (owner: 10Cathal Mooney) [10:47:24] oups [10:47:27] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix name of envoy service [puppet] - 10https://gerrit.wikimedia.org/r/720718 [10:51:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2002.codfw.wmnet [10:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:32] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2002.codfw.wmnet` - testvm2002.codfw.wmnet (**PASS**) - Downtimed host on Icing... [10:51:47] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T290764 (10cmooney) Ok @iamjessklein I think I've managed to process your request successfully. Can you test the access and let me know if it looks ok? thanks! [10:52:03] (03CR) 10Vgutierrez: [C: 03+2] embrace latest pylint recommendations [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717227 (owner: 10Vgutierrez) [10:52:13] (03CR) 10Vgutierrez: [C: 03+2] acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] - 10https://gerrit.wikimedia.org/r/717167 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [10:55:42] (03PS1) 10Vgutierrez: Release 0.30 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/720721 (https://phabricator.wikimedia.org/T290249) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T1100). Please do the needful. [11:00:05] kostajh: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:16] \o [11:00:18] I can deploy today! [11:00:24] a config change with no message :o [11:00:24] (unless kostajh wants to self-service) [11:00:42] oh, oops [11:00:54] I can deploy it [11:01:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "no codesearch results for UnderstandingFirstDay outside of this repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713553 (owner: 10Kosta Harlan) [11:01:09] go ahead then 🙂 [11:01:13] ok :) [11:01:29] (03CR) 10Vgutierrez: [C: 03+2] Release 0.30 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/720721 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [11:01:37] (03PS2) 10Lucas Werkmeister (WMDE): Don’t check constraints on two property qualifiers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) [11:02:17] (03CR) 10Lucas Werkmeister (WMDE): Don’t check constraints on two property qualifiers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) (owner: 10Lucas Werkmeister (WMDE)) [11:03:09] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31055/console" [puppet] - 10https://gerrit.wikimedia.org/r/720718 (owner: 10Giuseppe Lavagetto) [11:03:49] side note, do we have a tool that takes a URL like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/713553 and generates a message like `* [config] {{gerrit|713553}} WikimediaEvents: Remove UnderstandingFirstDay config` ? [11:03:59] I'm not aware of any [11:04:03] * urbanecm always does it manually [11:04:07] same [11:04:18] i'm that lazy :D [11:04:23] and as you can see, I make mistakes [11:04:33] who isn't? [11:04:34] (which is also why I filed T219809 at some point) [11:04:34] T219809: Triple-clicking Gerrit change subject selects unwanted space at the beginning - https://phabricator.wikimedia.org/T219809 [11:04:35] (in fact, i also type out the deployment commands manually, even we have a tool for that) [11:04:37] maybe it could be added to deploy-commands [11:04:38] who doesn't ? [11:05:09] (03Merged) 10jenkins-bot: Release 0.30 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/720721 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [11:05:16] :D [11:06:19] urbanecm: I'm supposed to use deployment.codfw.wmnet, right? [11:06:28] kostajh: deploy1002.eqiad.wmnet [11:06:32] deployment host was never switched to codfw [11:06:42] k [11:10:24] I guess this is obvious to deployers who have done this a while, but is it documented somewhere that step one of deploying a config patch is to +2 it with a message of "Backport / Config" (or some other message)? [11:11:02] I guess it’s more or less implied in https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Merging_patches [11:11:07] but not exactly stated [11:11:42] yeah I was just about to link to that [11:12:04] two paragraphs above it says “You will be merging patches…” but there’s no direct instruction “here, now, merge the patch” [11:12:06] that's probably more feedback for the deploy-commands tool, to say that you should first +2 [11:12:48] as an infrequent deployer, it's not totally intuitive because in local development you're usually doing e.g. `git review -d` if you want to verify a patch [11:13:07] (03CR) 10Kosta Harlan: [C: 03+2] "Backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713553 (owner: 10Kosta Harlan) [11:13:56] on the other hand, it is kinda similar to k8s deployments you do for linkrecommendation -- where you also need to +2 a patch in a repo. [11:14:02] (03PS1) 10Vgutierrez: embrace latest pylint recommendations [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720732 [11:14:03] (it's a different kind of a patch, but still) [11:14:04] (03PS1) 10Vgutierrez: acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720733 (https://phabricator.wikimedia.org/T290249) [11:14:06] (03PS1) 10Vgutierrez: Release 0.30 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720734 (https://phabricator.wikimedia.org/T290249) [11:14:08] (03PS1) 10Vgutierrez: debian: Add release 0.30 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720735 (https://phabricator.wikimedia.org/T290249) [11:14:10] (03Merged) 10jenkins-bot: WikimediaEvents: Remove UnderstandingFirstDay config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713553 (owner: 10Kosta Harlan) [11:14:54] urbanecm: hmm, yeah that makes sense. [11:15:28] (03PS1) 10Muehlenhoff: Reset Hiera flag for test setup [puppet] - 10https://gerrit.wikimedia.org/r/720736 [11:15:55] urbanecm: for the mwdebug step in https://deploy-commands.toolforge.org/bacc/713553, I should use codfw? [11:16:03] yes [11:16:21] (eqiad hosts will also work, but they're RO currently) [11:17:36] so, I did the `git fetch` command on deployment.eqiad, but now I'm on mwdebug2001.codfw, and `scap pull` says `11:16:43 Copying from deployment.codfw.wmnet to mwdebug2001.codfw.wmnet` [11:17:55] wouldn't that mean that the updated InitialiseSettings.php is not on mwdebug2001.codfw? [11:18:06] it does work [11:18:14] (deployment.codfw.wmnet is actually a cname to the eqiad host) [11:18:37] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:19:05] you can always verify by going to /srv/mediawiki/wmf-config at the mwdebug host :-) [11:20:41] ok [11:22:14] (03CR) 10Muehlenhoff: [C: 03+2] Reset Hiera flag for test setup [puppet] - 10https://gerrit.wikimedia.org/r/720736 (owner: 10Muehlenhoff) [11:24:02] !log kharlan@deploy1002 Synchronized wmf-config: Config: [[gerrit:713553|WikimediaEvents: Remove UnderstandingFirstDay config]] (duration: 00m 59s) [11:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:07] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:26:12] !log European mid-day backport window deploys done [11:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:33] (03CR) 10Btullis: Install Alluxio to the test cluster (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:32:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:25] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:37:10] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:38:21] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:39:37] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:41:27] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:43:23] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:52:53] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:56:45] (03PS1) 10Marostegui: dbstore1007: Decrease buffer_pool_sizes [puppet] - 10https://gerrit.wikimedia.org/r/720739 (https://phabricator.wikimedia.org/T290841) [11:58:06] (03CR) 10Marostegui: [C: 03+2] dbstore1007: Decrease buffer_pool_sizes [puppet] - 10https://gerrit.wikimedia.org/r/720739 (https://phabricator.wikimedia.org/T290841) (owner: 10Marostegui) [11:59:34] moritzm: the uncommitted dns changes seems related to testvm2002, did something not work as expected in the decommission cookbook? [12:02:25] not really sure, the output of the decom run was a little strange, it told me "Nothing to commit!" [12:02:30] on cumin2002 [12:02:41] doh [12:02:47] weird [12:03:00] let me run the netbox cookbook to sync that first [12:03:04] ack [12:03:32] !log volans@cumin1001 START - Cookbook sre.dns.netbox [12:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:59] happy to re-run the decom script, but I'll be reinstalling testvm2002 as well, so that would also straighten it out at some point [12:04:25] nah, has nothing to do with the decom cookbook, if it said nothing to commit means that the script on netbox1001 for some reason didn't see the changes [12:04:34] I'm checking the logs to see if we hit some race condition [12:07:49] ack [12:09:55] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:02] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [12:22:13] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:26:47] (03CR) 10Vgutierrez: [C: 03+2] embrace latest pylint recommendations [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720732 (owner: 10Vgutierrez) [12:26:50] (03CR) 10Vgutierrez: [C: 03+2] acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720733 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [12:26:54] (03CR) 10Vgutierrez: [C: 03+2] Release 0.30 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720734 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [12:27:07] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.30 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720735 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [12:30:39] (03Merged) 10jenkins-bot: embrace latest pylint recommendations [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720732 (owner: 10Vgutierrez) [12:30:41] (03Merged) 10jenkins-bot: acme_chief,api: Provide .chained.crt.key.ocsp [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720733 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [12:30:43] (03Merged) 10jenkins-bot: Release 0.30 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720734 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [12:30:48] (03Merged) 10jenkins-bot: debian: Add release 0.30 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720735 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [12:35:52] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 6 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) Translate patch should re... [12:39:11] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:41:54] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi) The other problem I noticed, though not specific to thanos but rather ferm + pontoon, is that `@resolve` calls will fail in WMCS: ` -- Boot 5c400f48d7c24abaaa78d50a591d8e8c -- Sep... [12:44:06] (03PS1) 10DCausse: query service: Fix loading of DCAT-AP dataset [puppet] - 10https://gerrit.wikimedia.org/r/720746 (https://phabricator.wikimedia.org/T289517) [12:51:02] (03CR) 10Jelto: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [12:58:11] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Yann)... [12:58:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10cmooney) @MRaishWMF Hi. We have noticed that there are two LDAP accounts in your name currently. Your original account was set up in Sept 2020 - us... [12:58:44] (03PS3) 10DCausse: search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) [13:04:00] (03CR) 10Elukey: [C: 04-1] kubernetes: add revscoring-editquality in the services configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [13:08:15] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum2001.codfw.wmnet [13:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kube_env: Give usage when no arguments are passed [puppet] - 10https://gerrit.wikimedia.org/r/719562 (owner: 10Hnowlan) [13:13:01] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [13:13:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038) (owner: 10Ahmon Dancy) [13:16:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038) (owner: 10Ahmon Dancy) [13:16:33] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki: fix name of envoy service [puppet] - 10https://gerrit.wikimedia.org/r/720718 (owner: 10Giuseppe Lavagetto) [13:17:30] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10chmielkomaslak) I'm very sorry for the delay I was on PTO, I could try it now and seems it working, thank You so much! [13:19:04] Daimona can you dump backtraces in the method that is failing up to the return call? [13:19:21] Naturally they should all be the same. [13:19:50] 10SRE, 10SRE-Access-Requests: Replace JAbrams' old ssh public key with a new one - https://phabricator.wikimedia.org/T290433 (10JAbrams) Hi all, Many thanks for your help. My access and password have been set up :). Cheers, Janina [13:20:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum2001.codfw.wmnet [13:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:25] I'm trying to get one [13:21:20] https://www.php.net/manual/en/function.debug-print-backtrace.php would probably be the easiest way to dump them. [13:21:49] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum2002.codfw.wmnet [13:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:08] That's just a PHP backtrace, not very useful here I'd say... But I'll post it on phab [13:23:01] Is there a way to get a trace on PHP itself? I'd be interested in knowing that? :O [13:23:46] But either way, I would venture a guess that even a PHP backtrace could be revealing as it would also need to rely on return addresses to even generate it I assume. [13:25:53] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:32:15] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:32:55] 10SRE, 10SRE Observability: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10ema) [13:35:15] (03CR) 10Elukey: [C: 04-1] kubernetes: add revscoring-editquality in the services configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [13:35:25] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:36:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum2002.codfw.wmnet [13:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:57] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:39:54] (03PS1) 10Dzahn: DHCP: add MAC addresses for durum2001 and durum2002 [puppet] - 10https://gerrit.wikimedia.org/r/720756 (https://phabricator.wikimedia.org/T290672) [13:40:00] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum3001.esams.wmnet [13:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:37] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC addresses for durum2001 and durum2002 [puppet] - 10https://gerrit.wikimedia.org/r/720756 (https://phabricator.wikimedia.org/T290672) (owner: 10Dzahn) [13:50:31] o/ we're going to do the services DC switchover in ~10 minutes [13:50:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum3001.esams.wmnet [13:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:33] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum3002.esams.wmnet [13:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:39] (03PS41) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [13:53:53] PROBLEM - mediawiki-installation DSH group on mw1414 is CRITICAL: Host mw1414 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:54:58] (03PS1) 10Vgutierrez: api: Fix destination on symlinks to files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/720759 (https://phabricator.wikimedia.org/T290249) [13:55:15] ^ volans: that was your test host I think? [13:55:38] yes mw1414 it's me, sorry about that moritzm [13:55:45] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) Ready to create Ganeti VM durum2001.codfw.wmnet in the ganeti01.svc.codfw.wmnet cluster on row C with 2 vCPUs, 4GB of RAM, 15GB of disk in the private network. Read... [13:55:46] I'll reimage it again in a few minutes [13:58:45] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10akosiaris) >>! In T210704#7345278, @Jdforrester-WMF wrote: > Has Thumbor been upgraded, or is this waiting on {T21... [13:59:22] (03PS1) 10Vgutierrez: Release 0.31 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/720761 (https://phabricator.wikimedia.org/T290249) [14:00:05] Deploy window Switch Datacenter - Services (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T1400) [14:00:05] Deploy window Switch Datacenter - MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T1400) [14:00:19] 10SRE, 10Observability-Metrics, 10Upstream: Grafana error: "parse error at char 1: unexpected character: '\\ufeff'" when copy-pasting metric names - https://phabricator.wikimedia.org/T263624 (10Aklapper) [Fixed upstream](https://github.com/grafana/grafana/pull/39117); to be included in 8.2.0 [14:00:23] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10akosiaris) Ah, `3d2png` is also shipped with thumbor and that one is indeed nodejs. I now get the question, sorry... [14:02:23] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Volans) So, it seems that we got some misunderstanding about expectations connected to the switchdc here between the various people involved. @Marostegui @Kormat In order to try to make some... [14:03:16] args lgtm [14:03:26] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum3002.esams.wmnet [14:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:52] * mutante stops doing things on ganeti [14:04:03] +1 [14:04:09] okay then I'm going to start with 00-reduce-ttl-and-sleep [14:04:25] (re those two jouncebot windows, note that only the services window is right now, not sure why the bot pinged both) [14:04:41] +1 [14:04:43] (I think I copy-paste failed, I was going to look afterward) [14:05:19] jelto: go for it [14:05:22] !log jelto@cumin2002 START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep [14:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:35] legoktm: (ah yep, sorted in the right section but with the wrong date filled in -- fixed) [14:06:23] ty :) [14:06:39] 2021-09-13 14:06:10,366 jelto 970264 [INFO 00-reduce-ttl-and-sleep.py:26 in run] Yes, that is 5 minutes. Blame Joe. <-- welcome back joe! [14:07:11] \o/ [14:08:18] haha [14:08:22] spot-checked a couple of TTLs with dig, looks good [14:08:27] (03PS5) 10Rishabhbhat: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) [14:09:05] 10SRE, 10Wikifeeds, 10serviceops, 10Patch-For-Review: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) I have started https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-06_Wikifeeds [14:10:47] (03CR) 10Vgutierrez: [C: 03+2] api: Fix destination on symlinks to files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/720759 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:10:52] (03CR) 10Vgutierrez: [C: 03+2] Release 0.31 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/720761 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:11:05] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) [14:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:13] I'm going to continue with 01-switch-dc, ping me to stop :) [14:12:18] +1 [14:12:24] hit it :D [14:12:31] !log jelto@cumin2002 START - Cookbook sre.switchdc.services.01-switch-dc [14:12:31] !log jelto@cumin2002 Switching services echostore, termbox, cxserver, eventstreams, search, ores, mathoid, schema, push-notifications, thanos-swift, wdqs, sessionstore, restbase, wdqs-internal, apertium, eventgate-analytics, citoid, api-gateway, restbase-async, proton, linkrecommendation, thanos-query, shellbox, kartotherian, mobileapps, recommendation-api, zotero, similar-users, shellbox-constraints, eventgate-logging-ex [14:12:31] ternal, eventgate-main, wikifeeds, eventstreams-internal, eventgate-analytics-external: codfw => eqiad [14:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:55] ^ oh yeah, we never fixed the second part of that line-length issue [14:13:00] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) Not sure why restbase is ticked off, though? The restbase hosts in production currently run nod... [14:13:13] at least we fixed the middle-truncation part [14:13:23] !log (cotd.) ternal, eventgate-main, wikifeeds, eventstreams-internal, eventgate-analytics-external: codfw => eqiad [14:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:27] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) [14:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:42] (03Merged) 10jenkins-bot: api: Fix destination on symlinks to files [software/acme-chief] - 10https://gerrit.wikimedia.org/r/720759 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:13:56] (03Merged) 10jenkins-bot: Release 0.31 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/720761 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:13:57] note that we do expect MW latency will go up a bit because service requests will be cross-DC [14:14:52] I see more load on eqiad kubernetes services and reduced load on codfw :) [14:14:59] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:15:16] ignore that [14:15:30] ha, I was about to say sessionstore in eqiad was mostlly 404s, but it was mostly 404s in codfw too :) so it looks good [14:15:34] jayme: ack, thanks [14:17:02] * legoktm hasn't seen anything wrong yet [14:17:30] do we wait/check more metrics and logs before restoring the ttl? Or continue with 02-restore-ttl? [14:18:39] wait a bit for an all go from legoktm [14:18:47] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:19:23] I've been skimming all the services dashboards, they all seem fine to me [14:19:50] so +1 from me on restoring TTL [14:20:23] Ok I restore the ttl [14:20:45] rzl: looks good to you? ^ [14:20:45] !log jelto@cumin2002 START - Cookbook sre.switchdc.services.02-restore-ttl [14:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:12] (03PS1) 10Vgutierrez: api: Fix destination on symlinks to files [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720763 (https://phabricator.wikimedia.org/T290249) [14:21:14] (03PS1) 10Vgutierrez: Release 0.31 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720764 (https://phabricator.wikimedia.org/T290249) [14:21:16] (03PS1) 10Vgutierrez: debian: Add release 0.31 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720765 (https://phabricator.wikimedia.org/T290249) [14:21:29] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) [14:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:48] oops, wish we'd waited another sec :) I just noticed services are pooled in eqiad but depooled in codfw, instead of pooled in both [14:21:56] which is what we want since these are meant to be active/active [14:22:14] not a disaster, and fine to just repool without adjusting the TTL I think [14:22:51] but I wasn't thinking about how the services switchover isn't symmetrical the same way the MW switchover is -- for a switch*back* we want to do something different than just running the cookbook again [14:23:01] legoktm: check me? [14:23:27] which services? I thought it was intentional that services go from only codfw to only eqiad [14:23:46] the normal state for most of these is active/active, so pooled in both [14:24:15] isn't that last step something we do after the mediawiki switchover though ? [14:24:19] discovery.pool(args.dc_to) [14:24:19] discovery.depool(args.dc_from) [14:24:22] is what the cookbook does [14:24:32] well, probably not per policy, but just historically [14:25:20] that at least was the spark for writing disc_desired_state.py, cause we ended up with codw indeed depooled [14:25:37] legoktm: right - last time I think we just did this with confctl, I think we've talked about adding a separate cookbook mode for it but there was some reason we didn't want to [14:25:51] (03CR) 10Vgutierrez: [C: 03+2] api: Fix destination on symlinks to files [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720763 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:25:54] (03CR) 10Vgutierrez: [C: 03+2] Release 0.31 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720764 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:25:55] akosiaris: that sounds plausible, I remember last time we talked about which way to do it but I forget what we decided [14:25:59] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.31 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720765 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:26:01] certainly believe you though [14:26:37] o/ [14:27:00] I also thought the same as lego.ktm, switching to eqiad only. Before the switch services traffic was served from codfw only from what I can see in the dashboards. [14:27:13] I'd need https://gerrit.wikimedia.org/r/c/operations/puppet/+/720251 to be shipped (known issue regarding wdqs updater and eventgate codfw->eqiad switch) [14:27:25] jelto: correct, that's what we do *during* the switchover [14:27:34] dcausse: I thought that was only needed tomorrow when MW switches? [14:27:34] now with the switchback we're restoring it to the normal state, which is active-active [14:28:01] legoktm: actually it's dependent on eventgate (destination topics for the job events) [14:28:31] jobs are now written to kafka-main@eqiad [14:28:42] (03Merged) 10jenkins-bot: api: Fix destination on symlinks to files [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720763 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:29:01] (03CR) 10Legoktm: [C: 03+2] Revert "[wdqs] switch updater reporting topic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/720251 (owner: 10DCausse) [14:29:09] legoktm: thanks! [14:29:10] (03Merged) 10jenkins-bot: Release 0.31 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720764 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:29:14] (03Merged) 10jenkins-bot: debian: Add release 0.31 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/720765 (https://phabricator.wikimedia.org/T290249) (owner: 10Vgutierrez) [14:29:55] (03PS1) 10Jcrespo: dbbackups: Remove s4 and s7 stretch backup source instances [puppet] - 10https://gerrit.wikimedia.org/r/720766 (https://phabricator.wikimedia.org/T288244) [14:29:58] (03PS1) 10Jcrespo: dbbackups: Reenable notifications on stretch hosts after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/720767 (https://phabricator.wikimedia.org/T288244) [14:30:02] I'll trigger puppet runs against wdqs[2001-2008].codfw.wmnet,wdqs[1003-1008,1010-1013].eqiad.wmnet [14:30:06] akosiaris: ohh, it's because last time we switched back MW and then switched back services [14:30:14] which is C:profile::query_service::updater [14:30:44] ah, indeed we had the inverse ofder [14:30:48] order* [14:31:29] dcausse: puppet finished [14:31:43] thanks! [14:33:20] do we have a list of which services are active/active and should be repooled in codfw? [14:33:57] afaik service::catalog has that info [14:34:00] it should be exactly the services acted on by that cookbook, but I'll double-check [14:34:32] yeah https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/services/__init__.py#31 [14:36:04] sorry for not thinking about this in advance :/ I think the current state makes as much sense as any, we could repool now or we could do it tomorrow after the MW switch, it doesn't make too much difference [14:38:02] !log restarting wdqs-updater.service on all wdqs servers [14:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:20] When repooling codfw services now I might need some assistance. I never pooled kubernetes services with conftool. I would try something link confctl select dc=codfw,name=kubernetes.* set/pooled=yes but I'm not sure :) [14:38:35] like* [14:38:41] https://phabricator.wikimedia.org/P17265 [14:39:26] I think we should just do it now, just keep it all self-contained to this window [14:39:30] legoktm: note that restbase gets some special treatment as described at https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Services [14:40:00] (03CR) 10Filippo Giunchedi: [C: 03+1] o11y: add logstash alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [14:40:11] yeah, and last time we decided to not do that and filed https://phabricator.wikimedia.org/T285711 for discussing [14:41:26] for the switch to codfw yeah, now that we're restoring it we should put it back to normal probably [14:41:44] which would be pooled in codfw, but not eqiad? [14:42:23] yeah, but only as of tomorrow I guess [14:43:01] well it was pooled in just codfw up until now, so restoring it to that state won't make a difference [14:43:14] I updated https://phabricator.wikimedia.org/P17265 with restbase-async at the bottom [14:43:21] okay cool [14:43:53] you might want to do it in the style of https://phabricator.wikimedia.org/P17266 instead -- same services, just done in one command [14:44:19] oh, much better [14:44:25] either works though [14:44:28] !log drained mx2001 mail queue to mx1001 T286911 [14:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:33] T286911: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 [14:44:48] jelto: do you want to do the honors? ^ see rzl's paste [14:45:02] legoktm: ok thanks for updating the paste, makes sense when I compare it with the wikitech page. So I first run the P17266 from rzl and after that only the last line of P17265, correct? [14:45:45] no, last two lines -- I'll update mine to include them [14:46:21] or I guess I'll update the first command to include restbase-async again [14:46:24] wait please [14:46:38] * jelto waits [14:47:03] okay, refresh https://phabricator.wikimedia.org/P17266 please, and legoktm check me? [14:47:59] lgtm [14:48:09] PROBLEM - puppet last run on deploy1002 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:48:35] ^ ignoring for now [14:48:52] ok then I'm going to pool services in codfw again, and after that depool restbase-async in eqiad [14:49:16] +1 [14:49:52] note this one will take up to 5 minutes to take effect, since we haven't restored the TTLs [14:50:23] ok. I will hit y and proceed, ping me to stop [14:50:36] I suppose we could do that first but I don't think it makes a huge difference [14:51:40] I don't think we need to, pretty unlikely we'll need to revert since we just came from codfw [14:51:43] PROBLEM - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:03] ! [14:52:05] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:52:17] what the... [14:52:19] jelto: wait please [14:52:32] ok [14:52:40] 2021-09-13 14:47:44 UTC Major FPC 1 Major Errors - MQSS Error code: 0x2204d6 [14:52:45] XioNoX: ^ [14:52:50] we didn't cause this but we can sort it out before sending more stuff to codfw, probably [14:52:51] cr2-codfw [14:53:35] RECOVERY - Host ms-be2045 is UP: PING OK - Packet loss = 0%, RTA = 31.54 ms [14:53:37] (03CR) 10BBlack: [C: 04-1] "LGTM overall, nice work! -1 for one security nitpick that should be easy to fix!" [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:54:14] topranks: ^ [14:54:28] chassid log on cr2-codfw is NOT happy [14:54:46] legoktm: as long as we're waiting for word on the router, we might as well shorten TTLs, wdyt? [14:55:03] yeah, good idea [14:55:09] we would just run the services cookbook again and use steps 0 and 2 for the TTLs, without running step 1 at all [14:55:10] * topranks looking [14:56:06] yeah I think FPC1 is fully offline [14:56:19] jelto: 👍 [14:56:20] topranks, akosiaris, it's the new unused linecard [14:56:32] I was about to see what uses it [14:56:34] https://phabricator.wikimedia.org/T271339 [14:56:36] rzl: so I rerun 00-reduce-ttl-and-sleep? and when we pooled codfw services again I run 02-... [14:56:46] +1 [14:56:50] yep, just so [14:56:53] !log jelto@cumin2002 START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep [14:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:51] hmm show chassis fpc says online, logs say offline [14:57:52] It shows as online in the status. [14:58:03] did it restart or somehting? [14:58:04] cmooney@re0.cr2-codfw> show chassis fpc pic-status 1 [14:58:04] Slot 1 Online MPC7E 3D MRATE-12xQSFPP-XGE-XLGE-CGE [14:58:04] PIC 0 Online MRATE-6xQSFPP-XGE-XLGE-CGE [14:58:04] PIC 1 Online MRATE-6xQSFPP-XGE-XLGE-CGE [14:58:13] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sda4.mount,srv-swift\x2dstorage-sdc1.mount,srv-swift\x2dstorage-sde1.mount,srv-swift\x2dstorage-sdf1.mount,srv-swift\x2dstorage-sdg1.mount,srv-swift\x2dstorage-sdh1.mount,srv-swift\x2dstorage-sdi1.mount,srv-swift\x2dstorage-sdj1.mount,srv-swift\x2dstorage-sdm1.mount,srv-swift\x2dstorage-sdn1.mount https://wikitech. [14:58:13] a.org/wiki/Monitoring/check_systemd_state [14:58:33] I'll take a look at ms-be2045 [14:58:44] just to make sure, ms-be2045 is related to cr2-codfw? or unrelated [14:58:56] * akosiaris not sure yet [14:59:24] probably not if it is unused, but isn't it too much of a coincidence? [14:59:38] yeah I'd say so, host plain rebooted by itself afaics [14:59:40] uptime on ms-be2045 is 7m [14:59:49] coincidence it is then [14:59:55] lol [14:59:55] RECOVERY - puppet last run on deploy1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:59:58] Emperor: thanks for that. [15:00:05] Deploy window Switch Datacenter - Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T1500) [15:00:21] godog knows what he's doing with swift, mostly ignore me :) [15:00:52] haha! please listen to both [15:00:59] * godog shakes hands confidently [15:01:55] (03PS1) 10Volans: sre.experimental.reimage: improve conftool support [cookbooks] - 10https://gerrit.wikimedia.org/r/720770 [15:01:59] I'll move the ms-be discussion to -data-persistence FWIW to leave space for cr2 [15:02:08] godog: ack [15:02:18] !log Restarting unused line-card FPC 1 in cr2-codfw in attempt to clear alarm. [15:02:19] in the logs I see the fpc errors intertwined with some tftpd messages [15:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:38] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) [15:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:49] it tries to tftp logs and debug commands somewhere? [15:03:28] anyway, meeting, I am sure both topranks and xionox are more capable than me. [15:04:19] possibly... not something I've seen before but I'm new with the Juniper chassis routers. [15:04:22] * topranks investigating [15:04:44] Alarm has cleared following line card reset anyway. [15:05:23] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:05:36] topranks: no hurry at all, but let us know when you think we're all clear to proceed with repooling codfw for active-active services [15:06:35] (03PS5) 10Vgutierrez: cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) [15:07:23] PROBLEM - Disk space on ms-be2045 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda3 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2045&var-datasource=codfw+prometheus/ops [15:08:03] rzl: no probs, just gonna wait to see if the FPC settles down (still booting essentially), but should be good to go. [15:08:07] (03PS6) 10Vgutierrez: cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) [15:08:17] rad [15:08:30] As it was unused card it shouldn't affect what you are doing, but just to be safe / verify it's not impacting anything else lets give it a minute or two. [15:08:48] yep, for sure [15:09:11] I'm fairly happy there now, it's rebooted and chassis is not reporting any alarms. [15:09:45] So continue whenver you are ready. [15:10:22] thanks :) [15:10:23] thanks! [15:10:49] legoktm, jelto: all set? [15:10:54] +1 from me [15:11:27] ok then I will pool services in codfw again [15:12:15] (03PS7) 10Vgutierrez: sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) [15:12:21] Ping me to stop :) [15:12:26] lgtm [15:12:31] !log jelto@cumin2002 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=(apertium|api-gateway|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventgate-main|eventstreams|eventstreams-internal|kartotherian|linkrecommendation|mathoid|mobileapps|ores|proton|push-notifications|recommendation-api|restbase|restbase-async|schema|search|sessionstore|shellbox|shell [15:12:32] box-constraints|similar-users|termbox|thanos-query|thanos-swift|wdqs|wdqs-internal|wikifeeds|zotero) [15:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:02] !log (contd.) box-constraints|similar-users|termbox|thanos-query|thanos-swift|wdqs|wdqs-internal|wikifeeds|zotero) [15:13:03] !log (cotd.) box-constraints|similar-users|termbox|thanos-query|thanos-swift|wdqs|wdqs-internal|wikifeeds|zotero) [15:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:06] heh [15:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:09] ahaha oops sorry [15:13:21] !log jelto@cumin2002 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=restbase-async [15:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:14] (03PS1) 10Dzahn: DHCP: add MAC addresses for durum3001 and durum3002 [puppet] - 10https://gerrit.wikimedia.org/r/720774 (https://phabricator.wikimedia.org/T290672) [15:15:17] hmm [15:15:25] https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&refresh=1m&var-dc=thanos&var-site=eqiad&var-service=sessionstore&var-prometheus=k8s&var-container_name=kask-production <-- looks like all traffic shifted to codfw? [15:15:41] yeah, that's what I'd expect -- all the traffic is coming *from* codfw [15:15:43] that's expected [15:15:45] right [15:16:01] so it'll prefer to stay there unless codfw couldn't handle the load [15:16:20] but tomorrow when MW switches, sessionstore will follow it cleanly [15:16:24] ok, this means that tomorrow we're effectively switching MW + service traffic in one go [15:16:39] having it in codfw will lower latencies [15:16:42] so +1 [15:16:57] this service is essentially in the hotpath of requests [15:17:09] (03PS2) 10Dzahn: DHCP: add MAC addresses for durum3001 and durum3002 [puppet] - 10https://gerrit.wikimedia.org/r/720774 (https://phabricator.wikimedia.org/T290672) [15:17:09] yeah, I can see the latency drop in the appserver dashboard already [15:17:21] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=9&orgId=1&refresh=1m&from=now-3h&to=now [15:17:41] (03CR) 10Vgutierrez: sslcert: Provide chained TLS cert with private key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:17:44] and "essentially" there was redundant. It's in the hot path of requests, period ;-) [15:18:40] everything still looking healthy afaict [15:18:45] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The following units failed: session-195785.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:06] (03CR) 10BBlack: [C: 03+1] "Thanks! Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:19:10] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC addresses for durum3001 and durum3002 [puppet] - 10https://gerrit.wikimedia.org/r/720774 (https://phabricator.wikimedia.org/T290672) (owner: 10Dzahn) [15:19:17] citoid is still mostly getting traffic in eqiad: https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=30s&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid [15:19:36] 10SRE-swift-storage: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10MatthewVernon) [15:20:19] !log rebooting ms-be2045 to see if that brings the disk back properly T290881 [15:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:23] (03PS1) 10Jelto: traffic: Depool codfw from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/720776 (https://phabricator.wikimedia.org/T287539) [15:20:24] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [15:21:47] (03CR) 10Ssingh: [C: 03+1] traffic: Depool codfw from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/720776 (https://phabricator.wikimedia.org/T287539) (owner: 10Jelto) [15:21:50] (03CR) 10Muehlenhoff: [C: 03+2] Temporarily filter port 25 on mx2001 for reimage [homer/public] - 10https://gerrit.wikimedia.org/r/720277 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [15:22:02] (03PS2) 10Legoktm: traffic: Depool codfw from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/720776 (https://phabricator.wikimedia.org/T287539) (owner: 10Jelto) [15:22:52] (03CR) 10Legoktm: [C: 03+1] "(Just moved the Bug: footer to have no gap between it and Change-Id, otherwise Gerrit doesn't see it as a footer)" [dns] - 10https://gerrit.wikimedia.org/r/720776 (https://phabricator.wikimedia.org/T287539) (owner: 10Jelto) [15:23:15] rzl, jelto: does everything look good enough to restore TTL? [15:23:23] PROBLEM - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:59] I'm happy, yeah [15:24:26] +1 [15:24:55] (03PS8) 10Vgutierrez: sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) [15:25:05] !log volans@cumin1001 START - Cookbook sre.experimental.reimage for host mw1414.eqiad.wmnet [15:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:10] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by volans@cumin1001 for host mw1414.eqiad.wmnet [15:25:17] then I'm going to restore the ttl for services [15:25:27] 👍 [15:25:29] +1 [15:25:37] !log jelto@cumin2002 START - Cookbook sre.switchdc.services.02-restore-ttl [15:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:21] !log jelto@cumin2002 END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) [15:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:07] (03CR) 10BBlack: [C: 03+1] sslcert: Provide chained TLS cert with private key [puppet] - 10https://gerrit.wikimedia.org/r/719044 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:27:12] (03CR) 10RLazarus: [C: 03+1] traffic: Depool codfw from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/720776 (https://phabricator.wikimedia.org/T287539) (owner: 10Jelto) [15:27:56] (03CR) 10Jelto: [C: 03+2] traffic: Depool codfw from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/720776 (https://phabricator.wikimedia.org/T287539) (owner: 10Jelto) [15:30:16] jelto: you'll need to manually !log running authdns-update from any authdns host, I tend to use authdns1001.wikimedia.org just because it's first [15:31:13] su I just run: root@authdns1001:~# authdns-update [15:31:15] ? [15:31:20] yes [15:31:31] it will prompt for confirmation of the diff, etc [15:31:45] (think of it like the authdns variant of "puppet-merge") [15:32:09] !log Traffic: depool codfw from user traffic [15:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:19] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) @ssingh durum2001, durum2002 and durum3001, durum3002 are ready to use, OS is installed but no role assigned. You can go ahead there. Other VM creations are in pro... [15:32:44] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) [15:32:51] diff looks good for me: [15:32:52] +### v T287539 depool codfw for one week [15:32:52] +geoip/generic-map/codfw => DOWN [15:32:53] +### ^ T287539 depool codfw for one week [15:32:53] T287539: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 [15:33:12] +1 [15:33:42] also icinga will alert that there's a drop in codfw traffic, that's to be expected [15:33:55] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:05] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10ssingh) Thank you for all the help, @Dzahn! And yes, I will take it from there, thanks. [15:34:48] OK - authdns-update successful on all nodes! [15:35:36] this one's a ten-minute TTL so don't expect anything to happen on your graphs immediately :) [15:36:24] Yes I was expecting some TTL delay here :) [15:39:39] RECOVERY - Host ms-be2045 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [15:39:59] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sda4.mount,srv-swift\x2dstorage-sdd1.mount,srv-swift\x2dstorage-sdk1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:02] !log upload acme-chief 0.31 to apt.wm.o (buster) - T290249 [15:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:07] T290249: Support OCSP stapling from prefetched responses in HAProxy - https://phabricator.wikimedia.org/T290249 [15:43:28] !log update acme-chief to version 0.31 on acmechief-test hosts - T290249 [15:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:35] 10SRE-swift-storage: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10MatthewVernon) On reboot, the disks came back, but many of the filesystems are unhappy: mvernon@ms-be2045:~$ sudo dmesg | grep 'Shutting down filesystem' [ 18.244602] XFS (sda3): Corruption of in-memory data det... [15:48:17] PROBLEM - Disk space on ms-be2045 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda3 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2045&var-datasource=codfw+prometheus/ops [15:48:25] seems like most of the traffic is switched [15:49:40] 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10MatthewVernon) a:03Papaul [15:50:15] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10odimitrijevic) p:05Triage→03Medium a:03razzi [15:50:39] 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10MatthewVernon) Hi @Papaul this system seems to have had a hardware fault(s), and is (just) still within its warranty, could you get the hardware checked out, please? Thanks :) [15:51:02] legoktm: can confirm, codfw is receiving a lot less traffic [15:51:36] https://grafana.wikimedia.org/d/000000093/varnish-traffic?viewPanel=24&orgId=1&from=now-3h&to=now&refresh=1m [15:51:54] nice job today jelto :) [15:52:08] nice work everyone! [15:52:21] yep, very clean! [15:52:41] the only two notes I have are: 1) have the cookbook allow keeping the (now) passive DC pooled, 2) fix the !log line splitting [15:53:45] for (1) seems that a simple option might do wit [15:53:49] *do it [15:54:45] !log filtered mx2001 on the routers for reimage T286911 [15:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:49] T286911: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 [15:55:28] thanks for the support as well! [15:55:41] yeah, either something like `sre.switchdc.services --pool-both` to indicate what it does, or something like `sre.switchdc.services --switchback` to indicate why you'd want it [15:55:46] jelto: nice job! [15:58:45] (03CR) 10Razzi: [C: 03+2] analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [16:00:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10RKemper) @Jclark-ctr We put this ticket on hold while https://phabricator.wikimedia.org/T280203 was getting cl... [16:01:19] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:27] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/720672 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:04:33] (03PS1) 10Muehlenhoff: Revert "Temporarily filter port 25 on mx2001 for reimage" [homer/public] - 10https://gerrit.wikimedia.org/r/720783 (https://phabricator.wikimedia.org/T286911) [16:05:35] (03CR) 10MMandere: [C: 03+2] cloud: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/720672 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:06:33] !log volans@cumin1001 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host mw1414.eqiad.wmnet [16:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:38] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - mw1414 (**PASS**) - Downtimed on Icinga - Depooled the following services from conf... [16:08:21] !log volans@cumin1001 conftool action : set/pooled=no; selector: name=mw1414.* [16:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:59] RECOVERY - Disk space on ms-be2045 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2045&var-datasource=codfw+prometheus/ops [16:12:51] PROBLEM - WDQS high update lag on wdqs1003 is CRITICAL: 3611 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:12:59] PROBLEM - WDQS high update lag on wdqs2005 is CRITICAL: 3619 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:13:01] PROBLEM - WDQS high update lag on wdqs1008 is CRITICAL: 3621 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:13:27] PROBLEM - WDQS high update lag on wdqs2008 is CRITICAL: 3647 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:13:51] PROBLEM - WDQS high update lag on wdqs2006 is CRITICAL: 3671 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:13:57] did we switch back eventgate to codfw? [16:14:21] PROBLEM - WDQS high update lag on wdqs1011 is CRITICAL: 3701 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:14:37] RECOVERY - mediawiki-installation DSH group on mw1414 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:14:38] rzl, legoktm, jelto ^^^ [16:14:59] dcausse: not on purpose but both dcs are active active now and discovery should choose codfw until tomorrow most of the time afaik [16:15:00] I see events in codfw.* topics [16:15:02] legoktm: sorry in a meeting, I can duck out if you need me too [16:16:00] !log volans@cumin1001 conftool action : set/pooled=yes; selector: name=mw1414.* [16:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:13] so actually on purpose, I just mean not codfw _only_ [16:16:41] Looking [16:17:34] dcausse: do you know which eventgate this is? [16:17:40] but eventgate was pointing at eqiad for a while (~14:00 CET to 15:15) and then codfw.* topics [16:17:43] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10elukey) ` elukey@an-worker1096:~$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components in optimal state) name: Adapter #0 Virtual Drive: 6 (Target... [16:17:43] legoktm: should be main [16:18:02] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: Force creation of tmp files before restart [puppet] - 10https://gerrit.wikimedia.org/r/720667 (https://phabricator.wikimedia.org/T276198) (owner: 10DCausse) [16:18:06] !log legoktm@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=eventgate-main [16:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:03] sorry, it slipped my mind that even though it's technically active/active, we rely on it only being in one DC [16:20:19] dcausse: there's a 5 min DNS TTL btw [16:20:39] legoktm: ok, thanks! [16:21:38] WDQS lag alerts should resolve themselve on the first job hitting the eqiad.mediawiki.revision-create topic [16:21:51] RECOVERY - WDQS high update lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 31.8 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:22:03] RECOVERY - WDQS high update lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 41.05 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:22:03] RECOVERY - WDQS high update lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 43.2 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:23:47] RECOVERY - WDQS high update lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 39.64 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:25:09] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:09] RECOVERY - WDQS high update lag on wdqs1011 is OK: (C)3600 ge (W)1200 ge 44.71 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:29:25] (03PS2) 10Jcrespo: dbbackups: Remove s4 and s7 stretch backup source instances [puppet] - 10https://gerrit.wikimedia.org/r/720766 (https://phabricator.wikimedia.org/T288244) [16:29:49] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - search-psi-https_9643: Servers elastic1059.eqiad.wmnet, elastic1049.eqiad.wmnet, elastic1044.eqiad.wmnet, elastic1048.eqiad.wmnet, elastic1052.eqiad.wmnet, elastic1047.eqiad.wmnet, elastic1067.eqiad.wmnet, elastic1035.eqiad.wmnet, elastic1045.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:29:54] (03PS1) 10Volans: sre.experimental.reimage: check also Icinga status [cookbooks] - 10https://gerrit.wikimedia.org/r/720787 [16:30:31] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:31:16] (03CR) 10Elukey: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [16:31:17] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1591 threshold =0.2 breach: cluster_name: production-search-codfw, status: red, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1291, active_shards: 2268, relocating_shards: 0, initializing_shards: 144, unassigned_shards: 1447, delayed_unassigned_shards: 0, number_o [16:31:17] g_tasks: 383, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 194008, active_shards_percent_as_number: 58.77170251360456 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:31:49] ouch ^ [16:32:31] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.4375 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:32:33] RECOVERY - WDQS high update lag on wdqs2008 is OK: (C)3600 ge (W)1200 ge 22.84 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:32:52] Seeing "search is too busy" on enwiki [16:33:01] out of meeting, looking now [16:33:21] I think all elasticsearch restarted on their own [16:33:25] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:33:42] did we do https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Days_in_advance_preparation_2 ? [16:34:03] (the elasticsearch cache pre-step) [16:34:26] legoktm: ^ [16:34:26] rzl: elasticsearch restarting is probably due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/720667 [16:34:28] https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?viewPanel=3&orgId=1&var-datasource=codfw%20prometheus%2Fops&from=1631540054176&to=1631550854176 [16:34:36] ack [16:34:41] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:35:20] dcausse, rzl I can confirm [16:35:34] all elasticsearch_6@production-search-eqiad units have an uptime of 6~8m [16:36:08] rzl: uh, no, I didn't. will do that in a minute, but it should only be needed for tomorrow [16:36:15] ryankemper: around --^ ? [16:36:19] 10SRE, 10Traffic, 10Patch-For-Review: Low root disk space on multiple eqsin cp nodes - https://phabricator.wikimedia.org/T290305 (10ema) 05Open→03Resolved a:03ema trafficserver-tls is not writing to local syslog anymore, and all eqsin hosts have at least ~ 30% available disk space now. Closing. [16:36:25] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: red, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1292, active_shards: 3115, relocating_shards: 0, initializing_shards: 128, unassigned_shards: 616, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, numb [16:36:25] _flight_fetch: 0, task_max_waiting_in_queue_millis: 109, active_shards_percent_as_number: 80.72039388442602 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:36:31] we have to wait for recovery, not sure we can do much [16:36:43] legoktm: (fwiw I think search sre normally does it, just wasn't sure if the box was checked this time or not) [16:37:06] elukey: around, yeah this is related to recent merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/720667 [16:37:51] it seems that that restarted the whole fleet apparently within 3 minutes, was a puppet run forced? I would have expected at least a 30m window [16:37:55] (03CR) 10Ema: [C: 03+1] "Looks good, thanks!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron) [16:38:00] yeah --^ [16:38:13] akosiaris: just checking, did you received my mail okay? [16:38:18] hi btw, sorry :) [16:38:48] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10RobH) [16:39:06] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10RobH) [16:39:07] (03PS1) 10Addshore: Add yarn to node images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/720788 [16:39:24] volans: yeah I ran puppet on one host and then followed up with the rest of the fleet 6 hosts at a time [16:39:32] just realized I forgot to log it :x [16:39:58] I'm surprised it dropped into red status [16:40:02] checking if we have any unrecovered indices [16:40:11] I think that the problem is that it restarted elastic ;), not the missing log [16:40:24] volans: yes agreed :) but I should have logged anyway! [16:41:03] 10ops-eqiad, 10DC-Ops, 10serviceops: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10RobH) p:05Medium→03High [16:41:10] ryankemper: just to understand, the daemons were supposed to be restarted as part of the puppet run or is it a side-effect that came right afterwards? [16:41:44] I guess the former since the units were changed, checking the patch [16:41:44] elukey: I believe side effect of the puppet run due to the changed unit [16:42:42] ryankemper: ack so as follow up let's figure out a safe way to roll out these changes in the future, it seems that too many restarts in a short timeframe may lead to the red status (not a problem this time, it happens, I am thinking about fences for the next time if anybody else has to do it as well( [16:42:44] `production-search-codfw` is in yellow status now, ~84% recovered [16:43:06] elukey: I agree, I'll take that followup [16:43:13] thanks! <3 [16:43:25] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:44:11] I think we might need an incident doc btw [16:44:41] `{"cluster_name":"production-search-eqiad","status":"yellow","timed_out":false,"number_of_nodes":35,"number_of_data_nodes":35,"active_primary_shards":1301,"active_shards":3318,"relocating_shards":0,"initializing_shards":116,"unassigned_shards":514,"delayed_unassigned_shards":0,"number_of_pending_tasks":32,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":997112,"active_shards_percent_as_number":84.04255319148936}` [16:44:46] just to understand, when system unit changes the unit is restarted automatically? [16:44:57] here's eqiad as well, in yellow status and also around 84% [16:45:01] Search is still broken on enwiki for me btw [16:45:27] dcausse: this is my confusion as well, I think there's certain changes where I haven't seen puppet restart the unit, but some that I have...will need to iron that out w followup [16:45:28] AntiComposite: thanks, it's recovering [16:45:35] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.375 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:45:39] api_appserver worker saturation is still high since ~16:29, no improvement yet https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=1631547923811&to=1631551523811 [16:46:38] dcausse: was puppet: Scheduling refresh of Exec[systemd daemon-reload for elasticsearch_6@.service] [16:46:39] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:46:44] for example: https://puppetboard.wikimedia.org/report/elastic1066.eqiad.wmnet/1e38352070c3913a7c9ecd9c94aa97549dcffd18 [16:47:46] I think we're about 15 minutes or so away from returning to green cluster status [16:48:07] I'm a little surprised we're still getting search too busy while in yellow status...presumably the cluster is just too busy moving shards around [16:48:10] dcausse,ryankemper yes by default when changing a systemd unit puppet restarts the service, unless instructed otherwise [16:48:27] my bad I did not know that :/ [16:50:15] it is fine! I think that we just need to put in place a workflow for the next time so that we rollout changes with a different pace (checking status of the canaries etc..). [16:50:48] agreed [16:50:53] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.3281 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:51:22] (03PS4) 10Cwhite: o11y: add logstash alerts [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) [16:52:04] (03CR) 10Cwhite: o11y: add logstash alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [16:52:45] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07812 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:53:10] there we go, api servers are starting to recover finally [16:55:59] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [16:57:23] !log volans@cumin1001 START - Cookbook sre.experimental.reimage for host sretest1002.eqiad.wmnet [16:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:39] ACKNOWLEDGEMENT - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [2000.0] Ryan Kemper elasticsearch in yellow cluster status shards steadily recovering https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoot [16:58:31] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Remove s4 and s7 stretch backup source instances [puppet] - 10https://gerrit.wikimedia.org/r/720766 (https://phabricator.wikimedia.org/T288244) (owner: 10Jcrespo) [17:00:05] ryankemper: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T1700). [17:01:06] ^ Will save the wdqs deploy for the second half of this week when the switchover is done [17:01:11] ryankemper, dcausse: I think this merits an incident report [17:01:20] legoktm: I'll get a doc started [17:02:05] ryankemper: I think you can skip the wdqs deploy this week [17:02:12] thanks [17:02:13] ack [17:03:15] the api servers are still unhappy btw, that improvement earlier didn't really stick https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-1h&to=now [17:03:21] is there anything we can do there, or just wait? [17:03:39] the latency is very slowly going down [17:03:51] (expecting the answer is wait, I just want to make sure) [17:04:24] latency and 5xxs both yeah, but they're both fluctuating enough that it's hard to be sure [17:06:27] enwiki shards are being recovered this should make a big diff [17:09:26] nice, my enwiki search request is showing results now [17:10:27] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-codfw instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic=deprecated https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=loggi [17:10:27] &var-topic=All&var-consumer_group=All [17:14:41] PROBLEM - Long running screen/tmux on gitlab2001 is CRITICAL: CRIT: Long running tmux process. (user: root PID: 17829, 1735925s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [17:17:50] !log [Cirrus] `enwiki` searches appear to be working now. `production-search-eqiad` is at 93.5% recovered shards, `production-search-codfw` is at 95.3% recovered [17:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:03] first draft of the incident doc is _almost_ up [17:18:52] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10Jgreen) 05Resolved→03Open @Papaul I'm not able to log in with the expected password. Can you double-check the password? [17:20:42] !log volans@cumin1001 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1002.eqiad.wmnet [17:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:58] rzl: actually, looks like hardcoding the more_like was taken care of after the switch to codfw, see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/704390 [17:24:46] ah okay -- so the eqiad warmup can happen afterward at dcausse's leisure, I guess? [17:29:20] 10ops-eqiad, 10DC-Ops: Q2: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) [17:29:30] 10ops-eqiad, 10DC-Ops: Q2: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) [17:30:45] (03PS3) 10Bartosz Dziewoński: DiscussionTools: Make 'newtopictool' available to everyone on arwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710076 (https://phabricator.wikimedia.org/T285724) [17:33:25] (03PS1) 10Volans: sre.experimental.reimage: print results to console [cookbooks] - 10https://gerrit.wikimedia.org/r/720793 [17:33:27] (03PS1) 10Volans: sre.experimental.reimage: improve unmask message [cookbooks] - 10https://gerrit.wikimedia.org/r/720794 [17:33:29] rzl: that's my understanding, yeah [17:33:48] sweet [17:39:00] (03PS1) 10Cwhite: logstash: filter high-volume ES deprecation message [puppet] - 10https://gerrit.wikimedia.org/r/720795 [17:41:55] (03PS2) 10Cwhite: logstash: filter high-volume ES deprecation message [puppet] - 10https://gerrit.wikimedia.org/r/720795 [17:42:10] (03CR) 10Cwhite: [V: 03+2 C: 03+2] logstash: filter high-volume ES deprecation message [puppet] - 10https://gerrit.wikimedia.org/r/720795 (owner: 10Cwhite) [17:48:01] I'm worried that we're still at high latency [17:48:43] !log [Cirrus] `eqiad` is at 99.13% shards recovered and `codfw` is at 98.83% [17:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:05] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10Cmjohnson) A new disk has been ordered and will be here this week. You have successfully submitted request SR1070175430. [17:53:22] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) After **round 1** fixes, we run another set of 10k requests with and without xhprof. Results can be found here: https://peop... [17:54:13] mm error rate started climbing a bit now [17:54:29] https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?viewPanel=3&orgId=1&var-datasource=codfw%20prometheus%2Fops&from=1631544865739&to=1631555665739 [17:54:57] technically not the error rate, but you get the idea [17:55:45] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [17:59:47] PROBLEM - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is CRITICAL: 348 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [18:00:01] Documentation for ~40 minute cirrussearch outage is up: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-13_cirrussearch_restart [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T1800). [18:00:05] No Gerrit patches in the queue for this window AFAICS. [18:04:09] ryankemper: thanks. is cirrussearch fully recovered now? we still have high latency on the MW side [18:05:06] I'm wondering if we just need a rolling restart [18:05:12] legoktm: 99.61% and 99.822% shards recovered for codfw and eqiad respectively. So it will be fully recovered shard-wise in like 2 minutes [18:05:16] cc rzl [18:05:20] legoktm: rolling restart of the appservers? [18:05:25] RECOVERY - MariaDB memory on dbstore1007 is OK: OK Memory 78% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:05:26] yeah [18:05:35] the api appservers specifically [18:05:40] okay good was gonna say a rolling restart of cirrus is the last thing we need right now :D [18:05:51] !log sudo systemctl restart mariadb@s2.service [18:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:01] lemme take a look at some of those graps [18:06:03] graphs* [18:06:30] legoktm: maybe yeah, ideally I'd like to have some sense of why [18:06:40] I'm tailing a slowlog on mw2295 (picked at random), sometimes its /srv/mediawiki/php-1.37.0-wmf.21/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php [18:06:42] (03PS2) 10Jforrester: Undeploy VipsScaler: I – Disable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720355 (https://phabricator.wikimedia.org/T290759) [18:06:45] I wouldn't be surprised if there's some sort of buildup of jobs that are needing to get burned through now [18:07:08] So in a vacuum I would expect it to be a "wait and let it recover" scenario and that restarts wouldn't help, but obv I don't know much about what jobs these api appservers are actually doing [18:07:31] most of the Cirrus/Elastic traces are direct API requests, e.g. [0x00007ff73321d720] searchText() /srv/mediawiki/php-1.37.0-wmf.21/includes/api/ApiQuerySearch.php:97 [18:07:54] we could always restart one of em and see if it comes back happier, but I'm not sure what state we would be flushing out from the appserver's side [18:08:14] I'd like to try one before we set off the rolling restart, but other than that, no objections [18:08:43] 10SRE, 10MW-on-K8s, 10serviceops: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10dancy) [18:08:45] PROBLEM - Disk space on ms-be2045 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2045&var-datasource=codfw+prometheus/ops [18:08:53] I'm not really sure either [18:09:55] Definitely agreed on starting with one if we go the restart route. I'd probably lean towards giving it ten minutes to see first (not that there would be any impact w/ one server, but just the point above about us not being certain what state there is for us to be flushing out) [18:10:14] ack [18:10:42] Also I'll need to add the downstream impacts on appserver api workers to the incident doc [18:11:08] legoktm: OK for me to deploy https://gerrit.wikimedia.org/r/720355 now? [18:11:09] ^ that was not related to the previous points, just noting it for my memory's sake [18:13:13] !log razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841 [18:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:17] T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 [18:14:31] James_F: lets wait until the current incident is over [18:14:40] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) [18:14:47] Fair. [18:16:03] Okay so there's a handful of `enwiki` shards still recovering on `codfw`, so that could explain the api appserver worker behavior [18:16:07] !log apply high log volume from ES mitigations to deprecated inputs [18:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:43] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) >>! In T210704#7348478, @MoritzMuehlenhoff wrote: > Not sure why restbase is ticked off, though?... [18:17:05] Well, if what they're requesting will be blocked on the recovery of those shards. I'm not very familiar with what work they're doing, if it's analogous to just an api request to the cirrus clusters then I'd expect that to just return a 5xx [18:17:06] ryankemper: thanks, that helps [18:18:53] These are all the shards left to recover on `codfw`. Note since we're in yellow cluster status that means any `UNASSIGNED` shards are replicas, not primaries (i.e. every index has at least one shard available to serve requests) https://www.irccloud.com/pastebin/Tpw6MP47/ [18:25:37] !log reenable replication on dbstore1007 for T290841 [18:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:42] T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 [18:26:17] enwiki search had recovered for me, is now erroring again (also noticed by one other personin Discord) [18:26:47] ryankemper, dcausse: ^^ I can reproduce too [18:27:42] 10SRE, 10ops-eqiad, 10DC-Ops: Q2: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) [18:29:14] !log [Cirrus] `eqiad` fully recovered (100% of shards), `codfw` at 99.816%. `codfw` is getting held up by recovery of `enwiki` shards which tend to be quite large [18:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:55] legoktm: AntiComposite: eqiad is fully recovered, so it must be attributable to the remaining codfw enwiki shards I posted above [18:30:11] There's not much (any) action we can actually take, but I'll poke around to make sure nothing else has gone wrong in the meantime [18:30:23] because MW is still in codfw, user traffic will end up there [18:30:35] (AIUI) [18:30:42] Makes sense to me [18:34:07] ryankemper, rzl: I have to step out for a minute to pick up lunch, I should be back in ~15m [18:34:20] ack, no worries [18:34:21] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 46.29 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:34:42] ^ all ok, we just had a bump 30 min ago [18:35:27] (03CR) 10Herron: "Kunal legoktm how might you approach installing NPM packages for something like this?" [puppet] - 10https://gerrit.wikimedia.org/r/720110 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:36:03] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [18:37:22] (03PS2) 10Jcrespo: dbbackups: Reenable notifications on stretch hosts after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/720767 (https://phabricator.wikimedia.org/T288244) [18:37:52] legoktm: AntiComposite: per https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?orgId=1, your searches should likely be working again [18:38:11] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:38:44] ryankemper: yep, appserver latency is back to normal, thank you! [18:39:51] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=10&orgId=1&from=1631547542314&to=1631558342314&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [18:39:53] yup looks to be working here, probably have to bump that outage time on the incident report tho :) [18:40:03] this shows the duration of appserver impact for your report timeline [18:40:32] AntiComposite: indeed [18:40:35] rzl: thanks! [18:42:51] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications on stretch hosts after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/720767 (https://phabricator.wikimedia.org/T288244) (owner: 10Jcrespo) [18:43:35] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [18:51:27] 10SRE, 10NavigationTiming, 10Performance-Team: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 (10Krinkle) [18:52:08] 10SRE, 10NavigationTiming, 10Performance-Team: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 (10dpifke) a:03dpifke [18:52:43] 10SRE, 10Analytics, 10Data-Engineering, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10nettrom_WMF) Hi all, coming back to this as I've been OoO. As mentioned in T288853#7288436, having this data is... [18:52:51] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) Still not working: https://foundation.wikimedia.org/static/current/skins/Timeless/resources/print.css [18:53:05] rzl: is it safe to do a deploy ATM? [18:54:19] urbanecm: yeah I was waiting a little longer just to make sure it stayed recovered, but I think we should be all clear -- I think James_F was waiting also, will let the two of you sort out ordering :) [18:54:44] thanks [18:55:40] (03PS2) 10Urbanecm: Add throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720458 (https://phabricator.wikimedia.org/T290809) [18:55:45] (03CR) 10Urbanecm: [C: 03+2] Add throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720458 (https://phabricator.wikimedia.org/T290809) (owner: 10Urbanecm) [18:55:53] I'll just start, it's a config patch, should be quick [18:56:55] (03Merged) 10jenkins-bot: Add throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720458 (https://phabricator.wikimedia.org/T290809) (owner: 10Urbanecm) [18:57:36] (03PS1) 10Ebernhardson: [WIP] query_service: Support proxying to microsite from backend [puppet] - 10https://gerrit.wikimedia.org/r/720801 [18:58:32] !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: 9db1d1ac938ca053c82fed88c8b6e75f97a52416: Add throttle rule for Czech wiki course (T290809) (duration: 00m 58s) [18:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:37] T290809: Add throttle rule for 2021-09-14 -- for Czech wiki course - https://phabricator.wikimedia.org/T290809 [18:59:05] !log [urbanecm@mwmaint2002 ~]$ mwscript resetAuthenticationThrottle.php --wiki={cswiki,cswikiversity} --signup --ip=185.47.223.49 # T290809 [18:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:11] * urbanecm done [18:59:13] James_F: clear for you [18:59:31] (03PS5) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [18:59:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:59:36] (03PS6) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [18:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:47] (03CR) 10jerkins-bot: [V: 04-1] Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [19:01:00] (03CR) 10jerkins-bot: [V: 04-1] Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [19:01:20] (03CR) 10Krinkle: Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [19:04:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:16] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Niharika) >>! In T288844#7321649, @mepps wrote: > @Niharika Based on my read, it also looks like the 10 day delay would only be when th... [19:18:09] PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:24:54] PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:59] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10Jgreen) 05Open→03Resolved fixed now! [19:29:38] (03Abandoned) 10Ebernhardson: query_service: Remove parts of gui from backend servers [puppet] - 10https://gerrit.wikimedia.org/r/720070 (owner: 10Ebernhardson) [19:33:22] (03PS1) 10Cwhite: logstash: use full_message field to drop at log entry point [puppet] - 10https://gerrit.wikimedia.org/r/720808 [19:34:29] urbanecm: Thanks! [19:34:48] Reedy, legoktm: Should we deploy the VipsScaler disablement and see if anything breaks? [19:36:14] sounds good to me [19:37:38] (03PS2) 10Cwhite: logstash: use full_message field to drop at log entry point [puppet] - 10https://gerrit.wikimedia.org/r/720808 [19:37:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10MRaishWMF) Hi @cmooney , thatnks for noticing that. Yes, the 'mraish' account was set up when I was still contracting, and I set up the 'Mikeraish' w... [19:40:30] (03CR) 10Cwhite: [C: 03+2] logstash: use full_message field to drop at log entry point [puppet] - 10https://gerrit.wikimedia.org/r/720808 (owner: 10Cwhite) [19:43:06] Okie-doke. [19:44:39] (03CR) 10Jforrester: [C: 03+2] Undeploy VipsScaler: I – Disable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720355 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [19:45:53] (03Merged) 10jenkins-bot: Undeploy VipsScaler: I – Disable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720355 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [19:46:54] (Live on mwdebug2001 if anyone wants to hunt for issues.) [19:47:13] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) [19:47:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:50] Yeah, all looks OK from my debug testing on Commons, Wikisources, and Wikipedias. [19:49:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:22] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T290759 Undeploy VipsScaler: I – Disable on all wikis (duration: 00m 57s) [19:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:28] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [19:53:51] OK, if anyone sees any wikis on fire please shout. :-) [19:53:54] (Done.) [20:00:05] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T2000). [20:00:38] RECOVERY - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is OK: (C)210 ge (W)150 ge 100.9 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [20:05:16] (03CR) 10Legoktm: wip: logagent: puppet module sketch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720110 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [20:07:32] (03PS1) 10Jforrester: phabricator: Add 'In Progress' task status [puppet] - 10https://gerrit.wikimedia.org/r/720811 (https://phabricator.wikimedia.org/T288956) [20:17:30] (03CR) 10Herron: wip: logagent: puppet module sketch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720110 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [20:22:47] (03CR) 10BryanDavis: "Jbond pinged me as the jerk^Wperson who first introduced this to ops/puppet. I think that having any information on puppet runs in the ELK" [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [20:32:29] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10sbassett) >>! In T288844#7349692, @Niharika wrote: >>>! In T288844#7321649, @mepps wrote: >> @Niharika Based on my read, it also looks... [20:37:17] (03PS2) 10Ebernhardson: query_service: Support proxying to microsite from backend [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) [20:44:21] 10SRE, 10Deployments, 10Stashbot: [[wikitech:Server_admin_log]] should not rely on IRC for logmsgbot entries - https://phabricator.wikimedia.org/T46791 (10Legoktm) [20:55:11] 10SRE, 10Deployments, 10Stashbot: [[wikitech:Server_admin_log]] should not rely on IRC for logmsgbot entries - https://phabricator.wikimedia.org/T46791 (10Jdforrester-WMF) [21:00:05] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T2100). [21:01:15] ^ I still don't get that joke [21:02:36] the English idiom is "when all you have is a hammer, everything looks like a nail" as a statement about how we adapt the problem to fit our tools instead of the other way around [21:02:51] PHP is famous for being a hammer that excels at hitting your own thumb [21:04:58] rzl: so, mutatis mutandis, PHP is complex and bites you from time to time? [21:08:11] that's about the extent of it, yeah :) the original also contains a certain amount of dry humor that's tougher to translate faithfully while explaining, so you'll have to just imagine that it's seasoned to taste [21:08:58] rzl: I understand it better now, thanks :-) I'm not an English speaker, sometimes it's difficult :) [21:15:02] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:16:50] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:23:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Jclark-ctr) @RKemper Started working on this again today installing rails and racking. should move along qu... [21:25:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Jclark-ctr) [21:26:38] (03PS1) 10Legoktm: irc: Split long !log lines [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) [21:28:34] (03PS2) 10Legoktm: irc: Split long !log lines [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) [21:28:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Jclark-ctr) elastic1068 A4 elastic1069 A4 elastic1070 A6 elastic1071 A6 elastic1072 A6 elastic1073 A6 elastic1... [21:31:29] (03CR) 10jerkins-bot: [V: 04-1] irc: Split long !log lines [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) (owner: 10Legoktm) [21:35:57] (03CR) 10Legoktm: "One of the prospector errors is mine, the others are pre-existing." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) (owner: 10Legoktm) [21:40:12] (03PS1) 10Ahmon Dancy: Add tests to exercise uses of the php symlink in operations/mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/720817 (https://phabricator.wikimedia.org/T285298) [21:53:58] (03CR) 10RLazarus: [C: 03+1] "Thanks for doing this! Please do wait for Volans, since he had some opinions on T285709." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/720816 (https://phabricator.wikimedia.org/T285709) (owner: 10Legoktm) [22:01:04] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:08:07] ebernhardson: Is the `es6` branch in mediawiki/vendor still needed? (you're the last committer, in 2019) [22:09:40] PROBLEM - ElasticSearch setting check - 9200 on elastic2025 is CRITICAL: CRITICAL - .search.remote.omega.seeds not found,.search.remote.psi.seeds not found https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:20] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) Can someone give me an example of a curl command that exercises the /w/static.php cod... [22:16:43] 10SRE, 10observability: Tooling for end-of-quarter SLO reporting - https://phabricator.wikimedia.org/T290924 (10RLazarus) [22:23:30] PROBLEM - ElasticSearch setting check - 9400 on elastic1040 is CRITICAL: CRITICAL - .search.remote.chi.seeds not found,.search.remote.psi.seeds not found https://wikitech.wikimedia.org/wiki/Search%23Administration [22:23:32] PROBLEM - ElasticSearch setting check - 9600 on elastic1048 is CRITICAL: CRITICAL - .search.remote.chi.seeds not found,.search.remote.omega.seeds not found https://wikitech.wikimedia.org/wiki/Search%23Administration [22:35:08] PROBLEM - ElasticSearch setting check - 9200 on elastic2031 is CRITICAL: CRITICAL - .search.remote.omega.seeds not found,.search.remote.psi.seeds not found https://wikitech.wikimedia.org/wiki/Search%23Administration [22:35:20] PROBLEM - ElasticSearch setting check - 9600 on elastic1050 is CRITICAL: CRITICAL - .search.remote.chi.seeds not found,.search.remote.omega.seeds not found https://wikitech.wikimedia.org/wiki/Search%23Administration [22:37:50] PROBLEM - ElasticSearch setting check - 9400 on elastic1038 is CRITICAL: CRITICAL - .search.remote.chi.seeds not found,.search.remote.psi.seeds not found https://wikitech.wikimedia.org/wiki/Search%23Administration [22:38:44] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10wiki_willy) [22:46:14] PROBLEM - ElasticSearch setting check - 9400 on elastic1034 is CRITICAL: CRITICAL - .search.remote.chi.seeds not found,.search.remote.psi.seeds not found https://wikitech.wikimedia.org/wiki/Search%23Administration [23:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210913T2300). [23:00:04] No Gerrit patches in the queue for this window AFAICS. [23:07:58] PROBLEM - ElasticSearch setting check - 9200 on elastic2042 is CRITICAL: CRITICAL - .search.remote.omega.seeds not found,.search.remote.psi.seeds not found https://wikitech.wikimedia.org/wiki/Search%23Administration [23:31:38] PROBLEM - ElasticSearch setting check - 9600 on elastic1052 is CRITICAL: CRITICAL - .search.remote.chi.seeds not found,.search.remote.omega.seeds not found https://wikitech.wikimedia.org/wiki/Search%23Administration [23:32:39] (03PS1) 10Urbanecm: enwiki: Bump Growth features to 25% (mentorship limited to 20% of those users) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720825 (https://phabricator.wikimedia.org/T290927) [23:38:00] Oh, oops. [23:38:07] I was going to do a couple of prod config deploys. [23:38:09] All clear? [23:38:52] (03PS2) 10Jforrester: Undeploy VipsScaler: II – Don't load regardless of config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720356 (https://phabricator.wikimedia.org/T290759) [23:38:59] (03CR) 10Jforrester: [C: 03+2] Undeploy VipsScaler: II – Don't load regardless of config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720356 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [23:39:45] (03Merged) 10jenkins-bot: Undeploy VipsScaler: II – Don't load regardless of config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720356 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [23:41:14] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: T290759: Undeploy VipsScaler: II – Don't load regardless of config (duration: 00m 58s) [23:41:15] (03PS2) 10Jforrester: Undeploy VipsScaler: III – Don't set wmgUseVips, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720357 (https://phabricator.wikimedia.org/T290759) [23:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:20] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [23:42:33] (03CR) 10Jforrester: [C: 03+2] Undeploy VipsScaler: III – Don't set wmgUseVips, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720357 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [23:42:38] (03PS2) 10Jforrester: Undeploy VipsScaler: IV – Don't load the i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720358 (https://phabricator.wikimedia.org/T290759) [23:43:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:46] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 75 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:43:57] (03Merged) 10jenkins-bot: Undeploy VipsScaler: III – Don't set wmgUseVips, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720357 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [23:44:56] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 92 probes of 495 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:44:59] (03CR) 10Jforrester: [C: 03+2] Undeploy VipsScaler: IV – Don't load the i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720358 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [23:45:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:45:10] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 155 probes of 625 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:31] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T290759: Undeploy VipsScaler: III – Don't set wmgUseVips, now ignored (duration: 00m 58s) [23:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:46] (03Merged) 10jenkins-bot: Undeploy VipsScaler: IV – Don't load the i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720358 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [23:46:42] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:49:36] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 42 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:50:32] OK, all clear. [23:50:48] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 53 probes of 495 (alerts on 65) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:51:02] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 42 probes of 625 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:51:13] 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Jdforrester-WMF) [23:51:34] 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Jdforrester-WMF) 05Stalled→03Open [23:52:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:16] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Jdforrester-WMF) Ping again. Any SREer have the time to finish this? [23:54:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log