[00:00:33] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37262/" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [00:00:45] (03PS6) 10Dzahn: gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [00:02:33] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/37263/" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [00:05:48] (03CR) 10Dzahn: [C: 03+2] "thanks! noop confirmed on both servers, one by one" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [00:06:01] (03PS5) 10Dzahn: gerrit: move jetty class to init [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [00:09:23] (03CR) 10Dzahn: "To be honest, I would have appreciated it if "move class around" would not have been mixed with all the other style fixes." [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [00:14:21] (03CR) 10Dzahn: "compiler output looks good to me except this one thing:" [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [00:18:29] (03CR) 10Dzahn: gerrit: move jetty class to init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [00:28:54] (03CR) 10Dzahn: gerrit: move jetty class to init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [00:30:49] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:32:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:33:08] (03CR) 10Dzahn: gerrit: move jetty class to init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [00:35:03] (03PS6) 10Dzahn: gerrit: move jetty class to init [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [00:37:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:40:12] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37265/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [00:43:29] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on both servers" [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [00:43:49] (03PS4) 10Dzahn: gerrit: change its templates to regular files [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [00:45:40] (03CR) 10Dzahn: "There is a file called ".keep" in that directory. Merging this, with the purge parameter in there, would delete that. But the name implies" [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [01:06:39] (03CR) 10Dzahn: [C: 03+2] gerrit: change its templates to regular files [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [01:07:02] (03CR) 10Dzahn: [C: 03+2] "assuming the .keep file is because this once was a git repo in the past?" [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [01:08:00] (03CR) 10Dzahn: [C: 03+2] "Notice: /Stage[main]/Gerrit/File[/var/lib/gerrit2/review_site/etc/its/templates/.keep]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [01:09:31] (03CR) 10Dzahn: [C: 03+2] gerrit: modernize spec [puppet] - 10https://gerrit.wikimedia.org/r/832260 (owner: 10Hashar) [01:09:38] (03PS3) 10Dzahn: gerrit: modernize spec [puppet] - 10https://gerrit.wikimedia.org/r/832260 (owner: 10Hashar) [01:13:53] (03CR) 10Dzahn: [C: 03+2] gerrit: gerrit-theme.html is long gone [puppet] - 10https://gerrit.wikimedia.org/r/832343 (https://phabricator.wikimedia.org/T299877) (owner: 10Hashar) [01:13:59] (03PS2) 10Dzahn: gerrit: gerrit-theme.html is long gone [puppet] - 10https://gerrit.wikimedia.org/r/832343 (https://phabricator.wikimedia.org/T299877) (owner: 10Hashar) [01:20:31] (03CR) 10Dzahn: "this does not exist at that location or has already been removed:" [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [01:21:35] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:21:40] (03CR) 10Dzahn: [C: 04-1] "Class[Gerrit]: expects a value for parameter 'daemon_user'" [puppet] - 10https://gerrit.wikimedia.org/r/832345 (owner: 10Hashar) [01:22:51] (03CR) 10Dzahn: "yea, so this would not make a difference but then we'd remove the code again in another change. if you are doing this for the devtools ins" [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [01:24:41] (03CR) 10Dzahn: [C: 04-1] gerrit: remove unused mysql-connector-java lib [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [01:27:16] (03CR) 10Dzahn: [C: 03+2] doc: Enable profile::auto_restarts::service for FPM [puppet] - 10https://gerrit.wikimedia.org/r/832253 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [01:31:18] (03Abandoned) 10Dzahn: webperf: add prometheus::blackbox::check::http for performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/823737 (https://phabricator.wikimedia.org/T277927) (owner: 10Dzahn) [01:33:10] (03CR) 10Dzahn: [C: 03+2] "I don't know how this actually gets merged." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 (owner: 10Dduvall) [01:34:10] ./away [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:49] PROBLEM - Check systemd state on dbprov1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:16] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:39:40] RECOVERY - Check systemd state on dbprov1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:23] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) >>! In T317662#8237830, @Ladsgroup wrote: >>>! In T317662#8233007, @Marostegui wrote: >> Started mysql for now. Will do a data check but will leave the host depooled. > > I think mysql went down again.... [05:12:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1189.eqiad.wmnet with reason: down T317662 [05:12:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1189.eqiad.wmnet with reason: down T317662 [05:12:58] T317662: db1189 broken memory - https://phabricator.wikimedia.org/T317662 [05:13:39] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) @Jclark-ctr the host is powered off, you can change the memory when it arrives. Please leave it back ON when done. Thank you! [05:32:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1189.eqiad.wmnet with reason: down T317662 [05:32:13] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 7 days, 0:00:00 on db1189.eqiad.wmnet with reason: down T317662 [05:32:16] T317662: db1189 broken memory - https://phabricator.wikimedia.org/T317662 [05:41:54] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2022-12-07 05:25:18 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [05:45:08] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:52:43] (03PS1) 10Marostegui: production-m3.sql.erb: Add grants [puppet] - 10https://gerrit.wikimedia.org/r/832389 (https://phabricator.wikimedia.org/T315713) [05:56:35] (03CR) 10Marostegui: [C: 03+2] production-m3.sql.erb: Add grants [puppet] - 10https://gerrit.wikimedia.org/r/832389 (https://phabricator.wikimedia.org/T315713) (owner: 10Marostegui) [05:57:46] PROBLEM - MegaRAID on an-worker1146 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:00:05] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220915T0600). [06:02:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Codfw switchover s3 T317839 [06:02:19] T317839: Switchover s3 codfw master (db2105 -> db2127) - https://phabricator.wikimedia.org/T317839 [06:02:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Codfw switchover s3 T317839 [06:03:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2127 with weight 0 T317839', diff saved to https://phabricator.wikimedia.org/P34740 and previous config saved to /var/cache/conftool/dbconfig/20220915-060307-root.json [06:04:15] (03PS1) 10Marostegui: mariadb: Promote db2127 to s3 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/832390 (https://phabricator.wikimedia.org/T317839) [06:05:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2127 to s3 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/832390 (https://phabricator.wikimedia.org/T317839) (owner: 10Marostegui) [06:07:06] (03PS2) 10WMDE-Fisch: [beta] Add WMDE Technical Wishes QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832158 (https://phabricator.wikimedia.org/T316676) [06:08:09] (03PS1) 10Marostegui: db2105: Host needs to go under maintenance [puppet] - 10https://gerrit.wikimedia.org/r/832391 (https://phabricator.wikimedia.org/T317839) [06:09:25] (03CR) 10Marostegui: [C: 03+2] db2105: Host needs to go under maintenance [puppet] - 10https://gerrit.wikimedia.org/r/832391 (https://phabricator.wikimedia.org/T317839) (owner: 10Marostegui) [06:12:30] !log Starting s3 codfw failover from db2105 to db2127 - T317839 [06:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:34] T317839: Switchover s3 codfw master (db2105 -> db2127) - https://phabricator.wikimedia.org/T317839 [06:13:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2127 to s3 codfw T317839', diff saved to https://phabricator.wikimedia.org/P34741 and previous config saved to /var/cache/conftool/dbconfig/20220915-061317-marostegui.json [06:14:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2105 T317839', diff saved to https://phabricator.wikimedia.org/P34742 and previous config saved to /var/cache/conftool/dbconfig/20220915-061421-root.json [06:18:42] (03PS1) 10Marostegui: Revert "db2105: Host needs to go under maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/832316 [06:34:14] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832393 (https://phabricator.wikimedia.org/T317841) (owner: 10Awight) [06:35:11] (03CR) 10Marostegui: [C: 03+2] Revert "db2105: Host needs to go under maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/832316 (owner: 10Marostegui) [06:35:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 1%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34743 and previous config saved to /var/cache/conftool/dbconfig/20220915-063538-root.json [06:40:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T317842 [06:40:04] T317842: Switchover x1 codfw master (db2115 -> db2096) - https://phabricator.wikimedia.org/T317842 [06:40:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2096 with weight 0 T317842', diff saved to https://phabricator.wikimedia.org/P34744 and previous config saved to /var/cache/conftool/dbconfig/20220915-064014-root.json [06:40:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T317842 [06:41:38] (03PS1) 10Marostegui: mariadb: Promote db2096 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/832394 (https://phabricator.wikimedia.org/T317842) [06:43:08] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2096 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/832394 (https://phabricator.wikimedia.org/T317842) (owner: 10Marostegui) [06:44:18] (03PS1) 10Marostegui: db2115: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/832395 (https://phabricator.wikimedia.org/T317842) [06:44:22] !log Starting x1 codfw failover from db2115 to db2096 - T317842 [06:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2096 to x1 primary and set section read-write T317842', diff saved to https://phabricator.wikimedia.org/P34745 and previous config saved to /var/cache/conftool/dbconfig/20220915-064525-root.json [06:45:26] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:45:29] T317842: Switchover x1 codfw master (db2115 -> db2096) - https://phabricator.wikimedia.org/T317842 [06:46:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2115 T317842', diff saved to https://phabricator.wikimedia.org/P34746 and previous config saved to /var/cache/conftool/dbconfig/20220915-064635-marostegui.json [06:46:48] (03CR) 10Marostegui: [C: 03+2] db2115: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/832395 (https://phabricator.wikimedia.org/T317842) (owner: 10Marostegui) [06:47:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some weight to db2096 T317842', diff saved to https://phabricator.wikimedia.org/P34747 and previous config saved to /var/cache/conftool/dbconfig/20220915-064750-marostegui.json [06:50:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 3%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34748 and previous config saved to /var/cache/conftool/dbconfig/20220915-065043-root.json [06:50:49] (03PS1) 10Marostegui: Revert "db2115: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/832317 [06:54:38] (03CR) 10Marostegui: [C: 03+2] Revert "db2115: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/832317 (owner: 10Marostegui) [06:55:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2115 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34749 and previous config saved to /var/cache/conftool/dbconfig/20220915-065510-root.json [06:55:58] (03CR) 10Awight: [beta] Add WMDE Technical Wishes QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832158 (https://phabricator.wikimedia.org/T316676) (owner: 10WMDE-Fisch) [06:58:44] (03PS3) 10Awight: [beta] Add WMDE Technical Wishes QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832158 (https://phabricator.wikimedia.org/T316676) (owner: 10WMDE-Fisch) [06:59:30] (03CR) 10Awight: [C: 03+2] "Deploying to the beta cluster." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832158 (https://phabricator.wikimedia.org/T316676) (owner: 10WMDE-Fisch) [07:00:05] Amir1, apergos, and jnuche: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220915T0700). [07:00:13] morning! we have a trainee signed up but no patches to deploy. hrm. don't go away though, folks, because there's alink to an as yet unlisted patch in their request for training. [07:00:13] (03Merged) 10jenkins-bot: [beta] Add WMDE Technical Wishes QuickSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832158 (https://phabricator.wikimedia.org/T316676) (owner: 10WMDE-Fisch) [07:05:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:05:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34750 and previous config saved to /var/cache/conftool/dbconfig/20220915-070548-root.json [07:06:51] the unlisted patch is being added to the deployment calendar and will be deployed in this morning's window. [07:06:59] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [07:09:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [07:10:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2115 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34751 and previous config saved to /var/cache/conftool/dbconfig/20220915-071015-root.json [07:11:20] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [07:12:05] (03PS1) 10Aqu: Make the 202208 snapshot available to AQS backend [puppet] - 10https://gerrit.wikimedia.org/r/832397 (https://phabricator.wikimedia.org/T317848) [07:12:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:12:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:13:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:13:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [07:14:56] !log installing zlib security updates [07:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2151.codfw.wmnet with reason: reboot [07:17:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: reboot [07:18:48] tsepoThoabala: welome, we'll be deploying your change today, now that it's on the calendar. [07:20:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34752 and previous config saved to /var/cache/conftool/dbconfig/20220915-072053-root.json [07:22:46] (03PS1) 10Muehlenhoff: apt_repo: Remove Apache2 leftovers [puppet] - 10https://gerrit.wikimedia.org/r/832398 [07:25:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2115 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34753 and previous config saved to /var/cache/conftool/dbconfig/20220915-072520-root.json [07:26:09] (03CR) 10Hashar: gerrit: ignore lint error in role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831932 (owner: 10Hashar) [07:28:10] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for nginx/aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/832399 (https://phabricator.wikimedia.org/T135991) [07:30:19] (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/832398 (owner: 10Muehlenhoff) [07:33:07] (03CR) 10Muehlenhoff: [C: 03+2] apt_repo: Remove Apache2 leftovers [puppet] - 10https://gerrit.wikimedia.org/r/832398 (owner: 10Muehlenhoff) [07:33:21] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for nginx/installservers [puppet] - 10https://gerrit.wikimedia.org/r/832401 (https://phabricator.wikimedia.org/T135991) [07:34:47] (03CR) 10Hashar: "Given phpd is driven by systemd, if the process ever vanishes that would cause the unit to be flagged as failing and Icinga will report it" [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) (owner: 10Dzahn) [07:34:59] (03CR) 10ArielGlenn: [V: 03+2] Enable action blocks on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832339 (https://phabricator.wikimedia.org/T317157) (owner: 10TsepoThoabala) [07:35:13] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] Enable action blocks on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832339 (https://phabricator.wikimedia.org/T317157) (owner: 10TsepoThoabala) [07:35:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34754 and previous config saved to /var/cache/conftool/dbconfig/20220915-073557-root.json [07:36:15] (03Merged) 10jenkins-bot: Enable action blocks on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832339 (https://phabricator.wikimedia.org/T317157) (owner: 10TsepoThoabala) [07:36:39] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/832399 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:38:29] I'm going to pull this onto mwdebug1002 for testing [07:40:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2115 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34755 and previous config saved to /var/cache/conftool/dbconfig/20220915-074026-root.json [07:43:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:43:50] your change is now on mwdebug1002,please test [07:46:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on db[2135,2160].codfw.wmnet with reason: reboot [07:46:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db[2135,2160].codfw.wmnet with reason: reboot [07:46:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on db[2134,2160].codfw.wmnet with reason: reboot [07:46:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db[2134,2160].codfw.wmnet with reason: reboot [07:46:45] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nginx/aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/832399 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:46:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on db[2133,2160].codfw.wmnet with reason: reboot [07:47:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db[2133,2160].codfw.wmnet with reason: reboot [07:47:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on db[2132,2160].codfw.wmnet with reason: reboot [07:47:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db[2132,2160].codfw.wmnet with reason: reboot [07:50:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:50:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:50:47] (03PS1) 10Muehlenhoff: Add separate logstash aliases for backend and collector nodes [puppet] - 10https://gerrit.wikimedia.org/r/832405 [07:51:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34756 and previous config saved to /var/cache/conftool/dbconfig/20220915-075102-root.json [07:53:51] testing is still in progress, we may run over a little bit in this window. [07:54:27] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:54:27] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:54:38] (03PS1) 10Hashar: gerrit: rm redundant service_params ensure => running [puppet] - 10https://gerrit.wikimedia.org/r/832411 [07:54:59] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:55:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2115 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34757 and previous config saved to /var/cache/conftool/dbconfig/20220915-075531-root.json [07:55:32] (03CR) 10CI reject: [V: 04-1] gerrit: rm redundant service_params ensure => running [puppet] - 10https://gerrit.wikimedia.org/r/832411 (owner: 10Hashar) [07:56:37] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:56:37] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:56:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:57:09] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [07:57:29] (03PS1) 10Hashar: gerrit: remove useless require => File[..] [puppet] - 10https://gerrit.wikimedia.org/r/832446 [07:57:51] (03CR) 10Hashar: gerrit: move jetty class to init (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [07:58:32] (03CR) 10CI reject: [V: 04-1] gerrit: remove useless require => File[..] [puppet] - 10https://gerrit.wikimedia.org/r/832446 (owner: 10Hashar) [07:59:08] (03PS1) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector/Kibana nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 [08:01:04] we're still not ready to fully deploy, please bear with us [08:01:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary codfw s6 T317850 [08:01:11] T317850: Switchover s6 codfw master (db2129 -> db2114) - https://phabricator.wikimedia.org/T317850 [08:01:12] maybe another 5-10 minutes [08:01:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2114 with weight 0 T317850', diff saved to https://phabricator.wikimedia.org/P34758 and previous config saved to /var/cache/conftool/dbconfig/20220915-080122-root.json [08:01:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary codfw s6 T317850 [08:01:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2114 from API T317850', diff saved to https://phabricator.wikimedia.org/P34759 and previous config saved to /var/cache/conftool/dbconfig/20220915-080157-root.json [08:02:49] (03CR) 10Muehlenhoff: gerrit: remove useless require => File[..] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832446 (owner: 10Hashar) [08:02:51] (03PS1) 10Marostegui: mariadb: Promote db2114 to s6 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/832448 (https://phabricator.wikimedia.org/T317850) [08:03:02] (03CR) 10CI reject: [V: 04-1] Add a cookbook to restart/reboot logstash collector/Kibana nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [08:03:27] (03CR) 10Filippo Giunchedi: "Very cool! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) (owner: 10Dzahn) [08:04:06] (03PS2) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector/Kibana nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 [08:05:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2114 to s6 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/832448 (https://phabricator.wikimedia.org/T317850) (owner: 10Marostegui) [08:05:09] (03CR) 10Filippo Giunchedi: [C: 03+2] Don't check TODO runbooks for existence [alerts] - 10https://gerrit.wikimedia.org/r/832265 (owner: 10Filippo Giunchedi) [08:05:43] (03CR) 10Filippo Giunchedi: [C: 03+2] Add missing dashboard/runbook annotations as TODOs [alerts] - 10https://gerrit.wikimedia.org/r/832261 (owner: 10Filippo Giunchedi) [08:06:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34760 and previous config saved to /var/cache/conftool/dbconfig/20220915-080607-root.json [08:06:08] (03CR) 10Filippo Giunchedi: [C: 03+2] Require dashboard and runbook annotations [alerts] - 10https://gerrit.wikimedia.org/r/832262 (owner: 10Filippo Giunchedi) [08:10:25] buying another 5-10 minutes, heh [08:10:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2115 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34761 and previous config saved to /var/cache/conftool/dbconfig/20220915-081036-root.json [08:10:47] note to patch owners: make sure your user has the required permissions to test your patch, before adding it to the calendar :-) [08:12:42] (03PS1) 10Marostegui: db2129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/832450 (https://phabricator.wikimedia.org/T317850) [08:13:40] (03CR) 10Marostegui: [C: 03+2] db2129: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/832450 (https://phabricator.wikimedia.org/T317850) (owner: 10Marostegui) [08:14:11] !log Starting s6 codfw failover from db2129 to db2114 - T317850 [08:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:15] T317850: Switchover s6 codfw master (db2129 -> db2114) - https://phabricator.wikimedia.org/T317850 [08:15:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2114 to s6 codfw master T317850', diff saved to https://phabricator.wikimedia.org/P34762 and previous config saved to /var/cache/conftool/dbconfig/20220915-081517-marostegui.json [08:15:53] (03PS3) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector/Kibana nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 [08:16:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2129 T317850', diff saved to https://phabricator.wikimedia.org/P34763 and previous config saved to /var/cache/conftool/dbconfig/20220915-081627-root.json [08:17:36] (03CR) 10Filippo Giunchedi: [C: 03+1] Add separate logstash aliases for backend and collector nodes [puppet] - 10https://gerrit.wikimedia.org/r/832405 (owner: 10Muehlenhoff) [08:19:40] (03CR) 10Btullis: [C: 03+2] Make the 202208 snapshot available to AQS backend [puppet] - 10https://gerrit.wikimedia.org/r/832397 (https://phabricator.wikimedia.org/T317848) (owner: 10Aqu) [08:20:07] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/832401 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:20:17] (03CR) 10Muehlenhoff: [C: 03+2] Add separate logstash aliases for backend and collector nodes [puppet] - 10https://gerrit.wikimedia.org/r/832405 (owner: 10Muehlenhoff) [08:20:22] aaannddd another 5-10 minutes (we are coordinating with someone else to do the testing, "it's complicated"(tm) [08:20:49] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nginx/installservers [puppet] - 10https://gerrit.wikimedia.org/r/832401 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:21:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! See inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [08:21:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34764 and previous config saved to /var/cache/conftool/dbconfig/20220915-082112-root.json [08:21:29] btullis: please merge my patch along if puppet-merge prompts you for that [08:21:46] ah, no. the lock is actually gone now, merging myself [08:21:55] 10SRE, 10Wikimedia-Mailing-lists: Team Mailing List - https://phabricator.wikimedia.org/T317851 (10Peachey88) a:05Dzahn→03None De-assigning so who ever is on clinic can pick it up. [08:22:20] Thanks moritz. It didn't show up for me. [08:22:28] 10SRE, 10Wikimedia-Mailing-lists: Create Dagbani Wikimedians Usergroup mailing list - https://phabricator.wikimedia.org/T317851 (10Peachey88) [08:22:51] (03PS1) 10Marostegui: Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/832321 [08:23:10] (03CR) 10Volans: [C: 04-1] "documentation/help message nit inline, LGTM beside that" [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [08:23:13] (03PS4) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector/Kibana nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 [08:23:26] (03CR) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector/Kibana nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [08:25:33] 10SRE, 10Wikimedia-Mailing-lists: Create Dagbani Wikimedians Usergroup mailing list - https://phabricator.wikimedia.org/T317851 (10Peachey88) This appears to already exist, per {T274582} and https://lists.wikimedia.org/postorius/lists/dagbani.lists.wikimedia.org/ Have you lost access since the MailMan 3 migra... [08:25:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2115 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34765 and previous config saved to /var/cache/conftool/dbconfig/20220915-082541-root.json [08:25:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10MoritzMuehlenhoff) >>! In T252807#8006300, @jbond wrote: > We should work out how much of this task is now covered by `SREBatchBase` Ack, I think this can... [08:26:03] !log tsepothoabala@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:832339|Enable action blocks on ptwiki (T317157)]] (duration: 04m 07s) [08:26:06] T317157: Activate action blocks on ptwiki - https://phabricator.wikimedia.org/T317157 [08:26:33] Ok, your patch is now live on the cluster, please test [08:27:23] 10SRE, 10Wikimedia-Mailing-lists: Create Dagbani Wikimedians Usergroup mailing list - https://phabricator.wikimedia.org/T317851 (10Peachey88) 05Open→03Stalled Stalled pending feedback [08:27:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10Volans) >>! In T252807#8006300, @jbond wrote: > We should work out how much of this task is now covered by `SREBatchBase` I agree, and also plan to move i... [08:28:15] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34766 and previous config saved to /var/cache/conftool/dbconfig/20220915-082851-root.json [08:28:54] (03CR) 10Marostegui: [C: 03+2] Revert "db2129: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/832321 (owner: 10Marostegui) [08:29:26] 10SRE, 10Wikimedia-Mailing-lists: Create Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (103245) 05Stalled→03Open [08:30:13] gehel: FYI ^^^ elastic2043 [08:30:43] volans: thanks! [08:31:01] 10SRE, 10Wikimedia-Mailing-lists: Create Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Peachey88) [08:32:03] 10SRE, 10Wikimedia-Mailing-lists: Create Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (103245) [08:34:05] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [08:34:23] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops-radar: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Clement_Goubert) FYI, the hosts that failed the poweroff step were powered down manually. They're all yours. [08:36:14] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10MoritzMuehlenhoff) >>! In T252807#8238518, @Volans wrote: >>>! In T252807#8006300, @jbond wrote: >> We should work out how much of this task is now covered... [08:37:31] (03PS1) 10David Caro: Fix logging message type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/832452 [08:38:02] (03CR) 10David Caro: bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [08:38:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10MoritzMuehlenhoff) [08:38:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10MoritzMuehlenhoff) 05Open→03Resolved a:03jbond But closing the task since we're all in agreement that SREBatchBase solved this. [08:40:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2115 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34767 and previous config saved to /var/cache/conftool/dbconfig/20220915-084046-root.json [08:43:45] !log about to deploy analytics/refinery [08:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 3%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34768 and previous config saved to /var/cache/conftool/dbconfig/20220915-084355-root.json [08:44:53] (03PS2) 10David Caro: Fix logging message typo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/832452 [08:45:16] (03CR) 10David Caro: "There was a typo on the word typo in the commit message of the change fixing a typo xd" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/832452 (owner: 10David Caro) [08:45:18] !log aqu@deploy1002 Started deploy [analytics/refinery@278c383]: Regular analytics weekly train [analytics/refinery@278c383] [08:46:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [08:47:28] (03CR) 10FNegri: [C: 03+2] Fix logging message typo (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/832452 (owner: 10David Caro) [08:47:55] (03CR) 10Hashar: gerrit: remove useless require => File[..] (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832446 (owner: 10Hashar) [08:48:02] (03PS2) 10Hashar: gerrit: remove useless require => File[..] [puppet] - 10https://gerrit.wikimedia.org/r/832446 [08:48:19] (03PS5) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector/Kibana nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 [08:49:11] (03PS2) 10Hashar: gerrit: rm redundant service_params ensure => running [puppet] - 10https://gerrit.wikimedia.org/r/832411 [08:49:22] !log UTC backport training window closed at lsat [08:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10Volans) Ack, no problem to wait. As for the analysis, AFAICT the only real thing missing is: * support for % batch size instead of integers (we could reus... [08:50:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, ping me when it's a good time and I can merge this" [puppet] - 10https://gerrit.wikimedia.org/r/832446 (owner: 10Hashar) [08:50:50] (03PS1) 10JMeybohm: Allow to silence "error handling stats line" messages [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/832453 (https://phabricator.wikimedia.org/T289766) [08:51:06] (03Merged) 10jenkins-bot: Fix logging message typo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/832452 (owner: 10David Caro) [08:53:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/832278 (owner: 10Muehlenhoff) [08:54:15] (03CR) 10David Caro: "One comment, otherwise looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/832355 (owner: 10Andrew Bogott) [08:54:16] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10MoritzMuehlenhoff) >>! In T252807#8238563, @Volans wrote: > As for the analysis, AFAICT the only real thing missing is: > * support for % batch size instea... [08:54:23] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/832453 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [08:57:50] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:58:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10Volans) The conversion of existing ones was there, there is a list of 4 cookbooks :) but no problem to move it elsewhere. [08:59:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34770 and previous config saved to /var/cache/conftool/dbconfig/20220915-085900-root.json [09:00:03] (03CR) 10Muehlenhoff: [C: 03+2] gerrit: remove useless require => File[..] [puppet] - 10https://gerrit.wikimedia.org/r/832446 (owner: 10Hashar) [09:00:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/832411 (owner: 10Hashar) [09:01:46] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10MoritzMuehlenhoff) >>! In T252807#8238574, @Volans wrote: > The conversion of existing ones was there, there is a list of 4 cookbooks :) but no problem to... [09:02:22] that `gerrit` puppet module will eventually looks cleaner thanks for all the reviews :] [09:07:24] (03CR) 10Filippo Giunchedi: "Overall LGTM, technically we're using "opensearch dashboards" now (not kibana) but I'm not feeling strongly one way or another. What do ot" [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [09:07:46] (03PS2) 10JMeybohm: Allow to silence "error handling stats line" messages [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/832453 (https://phabricator.wikimedia.org/T289766) [09:08:00] (03CR) 10JMeybohm: Allow to silence "error handling stats line" messages (031 comment) [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/832453 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [09:11:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/832453 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [09:11:47] (03CR) 10Filippo Giunchedi: [C: 03+1] sre: followup on Kafka partition replication alerts [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [09:11:49] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: followup on Kafka partition replication alerts [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [09:11:51] (03CR) 10WMDE-Fisch: [beta] Add WMDE Technical Wishes QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832158 (https://phabricator.wikimedia.org/T316676) (owner: 10WMDE-Fisch) [09:11:54] (03PS4) 10Filippo Giunchedi: sre: followup on Kafka partition replication alerts [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) [09:12:50] !log aqu@deploy1002 Finished deploy [analytics/refinery@278c383]: Regular analytics weekly train [analytics/refinery@278c383] (duration: 27m 31s) [09:13:25] (03CR) 10Filippo Giunchedi: "I know we're missing druid ZK (pending puppet changes discussed in comments). Ok to go ahead with this change ?" [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:13:28] !log aqu@deploy1002 Started deploy [analytics/refinery@278c383] (thin): Regular analytics weekly train THIN [analytics/refinery@278c383] [09:13:37] !log aqu@deploy1002 Finished deploy [analytics/refinery@278c383] (thin): Regular analytics weekly train THIN [analytics/refinery@278c383] (duration: 00m 08s) [09:13:59] (03PS6) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector/Kibana nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 [09:14:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34772 and previous config saved to /var/cache/conftool/dbconfig/20220915-091405-root.json [09:23:56] !log aqu@deploy1002 Started deploy [analytics/refinery@278c383] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@278c383] [09:26:35] (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:29:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34773 and previous config saved to /var/cache/conftool/dbconfig/20220915-092910-root.json [09:31:35] (KafkaUnderReplicatedPartitions) firing: (6) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:33:12] mmmmm [09:33:59] the jumbo cluster seems fine, no under replicated partitions (afaics from grafana) [09:34:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10MoritzMuehlenhoff) [09:35:31] (03CR) 10JMeybohm: [C: 03+2] Allow to silence "error handling stats line" messages [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/832453 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [09:36:57] checked also in thanos that max(kafka_server_ReplicaManager_UnderReplicatedPartitions) by (kafka_cluster) is zero [09:37:21] godog: o/ there is a weird thing, the KafkaUnderReplicatedPartitions shouldn't be firing (in theory) [09:37:26] but I am probably missing something [09:38:18] !log aqu@deploy1002 Finished deploy [analytics/refinery@278c383] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@278c383] (duration: 14m 21s) [09:43:20] elukey: mmhh checking, thanks for the heads up [09:44:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34774 and previous config saved to /var/cache/conftool/dbconfig/20220915-094415-root.json [09:46:02] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:46:52] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:47:30] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:47:55] elukey: doh of course I got the comparison wrong [09:48:39] (03PS1) 10Filippo Giunchedi: sre: fix KafkaUnderReplicatedPartitions comparison [alerts] - 10https://gerrit.wikimedia.org/r/832458 (https://phabricator.wikimedia.org/T309010) [09:48:43] elukey: ^ [09:50:14] 10SRE, 10Wikimedia-Mailing-lists: Create Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Masssly) >>! In T317851#8238507, @Peachey88 wrote: > This appears to already exist, per {T274582} and https://lists.wikimedia.org/postorius/lists/dagbani.lists.wikimedia.org/... [09:50:20] PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:51:22] 10SRE, 10Wikimedia-Mailing-lists: Create Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Peachey88) >>! In T317851#8238655, @Masssly wrote: > The already created public list above was for the entirety of the community. This new ticket is requesting a closed group... [09:51:37] (03PS1) 10Jelto: buildkitd: add option to enable proxy settings for buildkitd [puppet] - 10https://gerrit.wikimedia.org/r/832460 (https://phabricator.wikimedia.org/T308271) [09:53:28] (03CR) 10Elukey: [C: 03+1] sre: fix KafkaUnderReplicatedPartitions comparison [alerts] - 10https://gerrit.wikimedia.org/r/832458 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [09:55:12] godog: thanks! [09:56:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:57:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:57:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:57:47] (03CR) 10Muehlenhoff: Add a cookbook to restart/reboot logstash collector/Kibana nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [09:58:12] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: fix KafkaUnderReplicatedPartitions comparison [alerts] - 10https://gerrit.wikimedia.org/r/832458 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [09:58:14] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:58:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:58:46] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:59:16] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [09:59:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34775 and previous config saved to /var/cache/conftool/dbconfig/20220915-095920-root.json [09:59:21] (03PS1) 10Sergio Gimeno: Mentee overview: avoid requiring the non-vue mentee overview script when loading the Vue one [extensions/GrowthExperiments] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/832462 (https://phabricator.wikimedia.org/T300532) [09:59:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:00:05] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220915T1000). [10:02:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2131', diff saved to https://phabricator.wikimedia.org/P34777 and previous config saved to /var/cache/conftool/dbconfig/20220915-100212-root.json [10:02:43] (03CR) 10Muehlenhoff: [C: 03+2] Align includes with current practice [puppet] - 10https://gerrit.wikimedia.org/r/832278 (owner: 10Muehlenhoff) [10:02:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:03:07] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10HasanAkgun_WMDE) Hello, I just signed the agreement. [10:03:11] (03PS2) 10Sergio Gimeno: Enable the Vue version of the mentee overview in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830197 (https://phabricator.wikimedia.org/T300532) [10:03:27] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/814820 (owner: 10PipelineBot) [10:03:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on db2131.codfw.wmnet with reason: reboot [10:04:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2131.codfw.wmnet with reason: reboot [10:04:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:05:39] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10HasanAkgun_WMDE) @Dzahn no I realize that it's way more complicated than I think, so let's cancel the task and pretend like it never happened? [10:07:37] (03PS3) 10Sergio Gimeno: Enable the Vue version of the mentee overview in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830197 (https://phabricator.wikimedia.org/T300532) [10:07:53] (03CR) 10Sergio Gimeno: Enable the Vue version of the mentee overview in pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830197 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [10:07:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:12:21] (03CR) 10Filippo Giunchedi: Add a cookbook to restart/reboot logstash collector/Kibana nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [10:14:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34778 and previous config saved to /var/cache/conftool/dbconfig/20220915-101425-root.json [10:14:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 1%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34779 and previous config saved to /var/cache/conftool/dbconfig/20220915-101438-root.json [10:16:35] (KafkaUnderReplicatedPartitions) firing: (6) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:19:52] (03CR) 10CI reject: [V: 04-1] Mentee overview: avoid requiring the non-vue mentee overview script when loading the Vue one [extensions/GrowthExperiments] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/832462 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [10:19:58] ^ that should resolve [10:21:35] (KafkaUnderReplicatedPartitions) resolved: (6) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:23:20] (03CR) 10Awight: [C: 03+2] [beta] Add WMDE Technical Wishes QuickSurvey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832158 (https://phabricator.wikimedia.org/T316676) (owner: 10WMDE-Fisch) [10:29:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 3%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34780 and previous config saved to /var/cache/conftool/dbconfig/20220915-102943-root.json [10:32:56] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832464 (https://phabricator.wikimedia.org/T317862) (owner: 10Awight) [10:33:08] (03CR) 10Volans: [C: 03+1] "LGTM framework side, I'll leave it to o11y for the details of the service." [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [10:36:34] (03PS1) 10Muehlenhoff: planet: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/832466 (https://phabricator.wikimedia.org/T135991) [10:36:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:39:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:43:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:44:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34781 and previous config saved to /var/cache/conftool/dbconfig/20220915-104448-root.json [10:44:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:49:51] (03PS2) 10Hashar: gerrit: remove unused mysql-connector-java lib [puppet] - 10https://gerrit.wikimedia.org/r/832344 [10:50:08] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:50:52] (03CR) 10Hashar: "I got the path wrong, the jar files are under $GERRIT_SITE/lib." [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [10:52:50] (03CR) 10Btullis: [C: 03+1] sre: port Zookeeper alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:56:09] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/832468 (https://phabricator.wikimedia.org/T135991) [10:59:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34782 and previous config saved to /var/cache/conftool/dbconfig/20220915-105953-root.json [11:00:08] (03PS1) 10Jbond: spec_helper: include the monkey patch for the actual spec tests [puppet] - 10https://gerrit.wikimedia.org/r/832469 [11:00:27] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Aklapper) [11:04:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'pool db2129 into s6 API', diff saved to https://phabricator.wikimedia.org/P34783 and previous config saved to /var/cache/conftool/dbconfig/20220915-110453-root.json [11:07:49] (03CR) 10Hashar: gerrit: change its templates to regular files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [11:10:00] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on sessionstore1001.eqiad.wmnet with reason: Testing reimage [11:10:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sessionstore1001.eqiad.wmnet with reason: Testing reimage [11:12:22] !log sessionstore1001: c-foreach-nt drain [11:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34784 and previous config saved to /var/cache/conftool/dbconfig/20220915-111458-root.json [11:15:33] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [11:16:39] (03CR) 10Hashar: "That should be a require_relative ;)" [puppet] - 10https://gerrit.wikimedia.org/r/832469 (owner: 10Jbond) [11:17:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [11:22:09] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1001.eqiad.wmnet with OS buster [11:22:30] !log importing prometheus-rsyslog-exporter 0.0.0+git20201008-4 to stretch-wikimedia, buster-wikimedia, bullseye-wikimedia - T289766 [11:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:33] T289766: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 [11:25:30] PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:18] (03PS1) 10Ladsgroup: rdbms: Allow SubQuery objects in SelectQueryBuilder as table [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/832322 (https://phabricator.wikimedia.org/T314189) [11:30:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34785 and previous config saved to /var/cache/conftool/dbconfig/20220915-113003-root.json [11:35:59] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1059 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:38:21] (03PS1) 10Hnowlan: install_server: correct pathing for sessionstore mapper [puppet] - 10https://gerrit.wikimedia.org/r/832481 (https://phabricator.wikimedia.org/T303833) [11:40:37] (03PS1) 10Muehlenhoff: wikikube-etcd alias: Also include staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/832482 [11:41:09] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: port Zookeeper alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:41:15] (03PS8) 10Filippo Giunchedi: sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) [11:41:19] (03CR) 10Jcrespo: [C: 03+1] install_server: correct pathing for sessionstore mapper [puppet] - 10https://gerrit.wikimedia.org/r/832481 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [11:41:41] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/832481 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [11:42:32] (03CR) 10Hnowlan: [C: 03+2] install_server: correct pathing for sessionstore mapper [puppet] - 10https://gerrit.wikimedia.org/r/832481 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [11:42:37] (03CR) 10Filippo Giunchedi: [V: 03+2] sre: port Zookeeper alerts [alerts] - 10https://gerrit.wikimedia.org/r/818402 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [11:43:17] !log restart exim on lists1001 to pick up zlib security updates [11:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:25] !log hnowlan@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore1001.eqiad.wmnet with OS buster [11:45:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34786 and previous config saved to /var/cache/conftool/dbconfig/20220915-114508-root.json [11:45:16] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1001.eqiad.wmnet with OS buster [11:48:28] (03PS1) 10Muehlenhoff: matomo/piwik: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/832483 (https://phabricator.wikimedia.org/T135991) [11:50:17] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:50:46] !log hnowlan@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore1001.eqiad.wmnet with OS buster [11:51:01] RECOVERY - Check systemd state on ms-be1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:21] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore1001.eqiad.wmnet with OS buster [11:52:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:00:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34787 and previous config saved to /var/cache/conftool/dbconfig/20220915-120013-root.json [12:06:21] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1001.eqiad.wmnet with reason: host reimage [12:06:28] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10rook) https://github.com/toolforge/paws/pull/199 is approved and ready to go. Let me know when you are... [12:07:21] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1059 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:10:02] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1001.eqiad.wmnet with reason: host reimage [12:11:11] (03CR) 10Dreamy Jazz: "1.40.0-wmf.1 will be on all wikis in a few hours, so probably not a need to have this cherry-picked into the wmf.28 branch now." [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/831215 (https://phabricator.wikimedia.org/T317477) (owner: 10Jforrester) [12:11:27] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:33] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:17:42] !log fleet wide update of prometheus-rsyslog-exporter to 0.0.0+git20201008-4 - T289766 [12:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:46] T289766: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 [12:24:20] (03PS1) 10Hnowlan: install_server: enable unattended reimage for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/832490 (https://phabricator.wikimedia.org/T303833) [12:25:46] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1001.eqiad.wmnet with OS buster [12:31:45] PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: /sessions/v1/{key} (Store value for key) is CRITICAL: Test Store value for key returned the unexpected status 500 (expecting: 201) https://www.mediawiki.org/wiki/Kask [12:31:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/832490 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [12:32:09] (03CR) 10Elukey: [C: 03+1] install_server: enable unattended reimage for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/832490 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [12:32:41] (03CR) 10Hnowlan: [C: 03+2] install_server: enable unattended reimage for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/832490 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [12:34:03] RECOVERY - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [12:34:58] I definitely caused ^ but I am not sure why it would happen, looking [12:35:15] PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [12:37:16] Hello! Happy Thursday. I and a number of others are experiencing CSRF-related issues. [12:37:55] several can't log in. one can't load their watchlist. I kept having CSRF failures while trying to run a JWB task [12:37:59] (03CR) 10Volans: "question inline as I don't have enough context" [puppet] - 10https://gerrit.wikimedia.org/r/832482 (owner: 10Muehlenhoff) [12:39:20] Tamzin: thanks, I'm investigating this now. There's a problem with session storage [12:39:44] (03PS2) 10Muehlenhoff: docker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/832252 (https://phabricator.wikimedia.org/T308013) [12:40:17] thanks [12:40:59] PROBLEM - MediaWiki edit session loss on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 [12:41:12] (03CR) 10Hashar: [C: 03+1] Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/832468 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:41:30] (03CR) 10Muehlenhoff: wikikube-etcd alias: Also include staging hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832482 (owner: 10Muehlenhoff) [12:46:01] PROBLEM - Check systemd state on sessionstore1001 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:20] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on sessionstore1001.eqiad.wmnet with reason: temporarily disabled due to sessionstore issues [12:46:44] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sessionstore1001.eqiad.wmnet with reason: temporarily disabled due to sessionstore issues [12:48:02] (a +1 from a Puppet Expert™ on what we're trying to do in https://gerrit.wikimedia.org/r/c/operations/puppet/+/831955 would be really appreciated prior to it's scheduled deployment window, just in case we need to change anything!) [12:48:12] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [12:48:15] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) 05duplicate→03Open Reopening to track the implementation/deployment of this fleetwide [12:48:16] Tamzin: I'm hoping this will have been fixed for the time being, could you confirm? [12:49:19] the CSRF issue I was having is nontrivial to replicate; lemme ping some of the other people who reported it on Discord [12:49:28] 1 "working fine" [12:50:06] 1 now able to log in [12:50:07] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:50:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:50:35] There was a surge in errors for sessions that has since abated [12:50:49] RECOVERY - MediaWiki centralauth errors on graphite1004 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [12:50:51] third vote of confidence. think we're good, yeah [12:51:47] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:52:14] Tamzin: thanks a lot for the report [12:52:25] RECOVERY - MediaWiki edit session loss on graphite1004 is OK: OK: Less than 30.00% above the threshold [10.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 [12:57:45] thanks for the fast response :) [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220915T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220915T1300). [13:00:05] sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] i can deploy today [13:00:24] hi sergi0_! [13:00:37] \o/ (I’ll have to leave before the end of the deployment window, so leaving things in urbanecm’s capable hands ^^) [13:00:46] :) [13:00:49] * urbanecm waves to Lucas_WMDE [13:01:00] * Lucas_WMDE waves back [13:01:09] hi [13:01:20] hey :) [13:01:40] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Tracking for memory is today I will be on site all day and will take care of it when it arrives [13:01:55] so, test failed, but i'm going to hope it's just a temp issue [13:01:58] rebasing & +2'ing [13:02:11] (03CR) 10Urbanecm: [C: 03+2] Mentee overview: avoid requiring the non-vue mentee overview script when loading the Vue one [extensions/GrowthExperiments] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/832462 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:02:33] urbanecm: fingers crossed. [13:03:31] urbanecm: the patch does not fix a user facing issue but reduces scripts evaluate. Wouldn't be a big deal if we enable the dashboard without it. Plus wmf.1 would amend it in a few hours. [13:03:46] noted :). [13:03:55] *evaluated [13:05:09] (03PS1) 10JMeybohm: Prevent rsyslog-exporter from logging "error handling stats line" messages [puppet] - 10https://gerrit.wikimedia.org/r/832493 (https://phabricator.wikimedia.org/T289766) [13:05:59] 10SRE, 10Infrastructure-Foundations, 10Mail: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890 (10fgiunchedi) From alerting/observability's POV this will "fix" by itself as we progressively move away from Icinga for paging and into AM to handle all paging. Thus I'm untagging... [13:07:50] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37273/console" [puppet] - 10https://gerrit.wikimedia.org/r/832493 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [13:08:31] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:53] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:10:18] (03CR) 10Filippo Giunchedi: [C: 03+1] Prevent rsyslog-exporter from logging "error handling stats line" messages [puppet] - 10https://gerrit.wikimedia.org/r/832493 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [13:11:21] (03CR) 10JMeybohm: [C: 04-1] wikikube-etcd alias: Also include staging hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832482 (owner: 10Muehlenhoff) [13:11:44] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Prevent rsyslog-exporter from logging "error handling stats line" messages [puppet] - 10https://gerrit.wikimedia.org/r/832493 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [13:12:48] (03CR) 10Jelto: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/832460 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [13:15:09] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Drop deprecated survey prefix (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832464 (https://phabricator.wikimedia.org/T317862) (owner: 10Awight) [13:17:07] (03PS2) 10Muehlenhoff: wikikube-etcd alias: Also include staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/832482 [13:17:18] (03CR) 10Muehlenhoff: wikikube-etcd alias: Also include staging hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832482 (owner: 10Muehlenhoff) [13:22:49] just one more test to finish... [13:23:21] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/832468 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:23:27] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/832468 (https://phabricator.wikimedia.org/T135991) [13:26:57] (03Merged) 10jenkins-bot: Mentee overview: avoid requiring the non-vue mentee overview script when loading the Vue one [extensions/GrowthExperiments] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/832462 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:27:09] finally [13:27:45] sergi0_: your patch is at mwdebug1001, can you test please? [13:27:59] (not 100% sure it's testable; let me know if it's not) [13:28:33] I don't think we can test much until we enable, just no error logs [13:28:52] sergi0_: okay. so, good to sync from your PoV? [13:29:01] urbanecm: yep [13:29:04] doing [13:30:42] do we have any "fake mentor" for prod testing in any wiki? That could me test help in some cases but not sure if its canonical [13:32:09] sergi0_: we have `Mentor dashboard usability test` which we used in the mentor dashboard usability test [13:32:25] it's testwiki though, and it has very few active mentees [13:32:58] 10SRE, 10Privacy Engineering, 10WMF-Legal, 10Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104 (10Krinkle) 05Open→03Resolved a:03Krinkle I do consider it a net-negative in quality of outcome, cost in both time and finance, and in violati... [13:33:27] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [13:33:27] sergi0_: you're also free to make use of the "manual" list and become ie. an enwiki mentor for testing purposes (https://en.wikipedia.org/wiki/Wikipedia:Growth_Team_features/Mentor_list/Manual). me, Marshall and Elena use that for actual production testing. [13:33:46] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.28/extensions/GrowthExperiments/: f592e85858d17a2de99cde93627054ee4972c2bd: Mentee overview: avoid requiring the non-vue mentee overview script when loading the Vue one (T300532) (duration: 04m 05s) [13:33:50] T300532: Migration of mentee overview to Vue - https://phabricator.wikimedia.org/T300532 [13:33:58] !log restarting pdns-recursor on A:dns-rec for zlib update [13:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:02] sergi0_: anyway, patch's live. anything else to deploy? [13:34:08] urbanecm: oh I see, might be very helpful in some cases, thank you! [13:34:15] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:48] sergi0_: any time :). Not sure if the manual list will let you through (it's protected), but if not, lmk and i can add you there. [13:34:54] inflatador: ^^ elastic2043 [13:35:00] urbanecm: patch's live means config patch? [13:35:05] the backport [13:35:12] is there supposed to be a config patch? [13:35:23] oh [13:35:28] i missed it at the calendar [13:35:29] sorry, by enable I was meaning merging https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830197, it's after on the backport window [13:35:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:35:45] (03CR) 10Urbanecm: [C: 03+2] Enable the Vue version of the mentee overview in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830197 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:35:59] yep yep, the newline in the commit message confused me. [13:36:00] on it [13:36:06] gehel ACK, I'm getting highlights on 'elastic' now, and working on it [13:36:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:36:31] (03Merged) 10jenkins-bot: Enable the Vue version of the mentee overview in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830197 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:36:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:36:35] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:11] (03PS3) 10Andrew Bogott: toolviews.py: add daily prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/832355 [13:37:20] sergi0_: pulled to mwdebug1001, can you test? [13:37:25] (the config patch) [13:37:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:37:36] on it [13:38:13] note eswiki doesn't have mentorship enabled yet, so it won't work there (only after they enable it, which should happen fairly soon) [13:38:16] !log restarting bird.service on A:dns-rec for zlib update [13:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:42] ^ expected, should resolve [13:41:08] !log aqu@deploy1002 Started deploy [analytics/refinery@278c383] (hadoop-test): Regular analytics weekly train TEST (second try after freeing up some disk space) [analytics/refinery@278c383] [13:42:11] urbanecm: Can you add SGimeno_(WMF) in https://en.wikipedia.org/wiki/Wikipedia:Growth_Team_features/Mentor_list/Manual? [13:42:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:43:15] sergi0_: certainly: https://en.wikipedia.org/w/index.php?title=Wikipedia:Growth_Team_features/Mentor_list/Manual&diff=1110440053&oldid=1104771671 [13:43:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:43:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:45:50] urbanecm: I guess we need to run the maintenace script to see the change? [13:46:21] PROBLEM - cassandra-a CQL 10.64.0.144:9042 on sessionstore1001 is CRITICAL: connect to address 10.64.0.144 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:46:48] sergi0_: AFAICS it already recognizes you as a mentor [13:46:51] PROBLEM - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:46:53] https://www.irccloud.com/pastebin/rD4Sm3B8/ [13:47:09] !log aqu@deploy1002 Finished deploy [analytics/refinery@278c383] (hadoop-test): Regular analytics weekly train TEST (second try after freeing up some disk space) [analytics/refinery@278c383] (duration: 06m 01s) [13:47:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:47:17] PROBLEM - cassandra-a service on sessionstore1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:47:32] i ran `>>> \MediaWiki\MediaWikiServices::getInstance()->get('GrowthExperimentsMentorProvider')->invalidateCache()` just in case [13:48:34] (if you claimed some users, yes, maint script, or call `action=growthmentordashboardupdatedata` via the API) [13:50:26] urbanecm: meh, still the test needs to happen in a pilot wiki... [13:50:32] ah [13:50:58] !log updated rsyslog to 8.2208.0-1~bpo11+1 on all kubernetes masters and nodes - T289766 [13:51:01] sergi0_: added you as a cswiki mentor [13:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:02] T289766: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 [13:53:10] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10rook) [13:54:46] urbanecm: much appreciated. I can see the vue dashboard enabled! 1 more min of test [13:54:51] sure! [13:56:20] looks good to me [13:56:35] great! syncing [13:56:53] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@b9be20d]: Regular analytics weekly train TEST [airflow-dags@b9be20d] [13:57:03] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@b9be20d]: Regular analytics weekly train TEST [airflow-dags@b9be20d] (duration: 00m 10s) [13:57:21] !log retarting haproxy.service on A:dns-auth for zlib update [13:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:11] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@b9be20d]: Regular analytics weekly train [airflow-dags@b9be20d] [13:58:21] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@b9be20d]: Regular analytics weekly train [airflow-dags@b9be20d] (duration: 00m 09s) [14:00:40] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6b9784a0708cf1e7762034ccfba7e5604b2f6dc2: Enable the Vue version of the mentee overview in pilot wikis (T300532) (duration: 03m 45s) [14:00:44] T300532: Migration of mentee overview to Vue - https://phabricator.wikimedia.org/T300532 [14:00:48] sergi0_: and, we're live! [14:00:50] anything else? [14:00:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:02] ^ expected, should resolve [14:01:09] !log retarting bird.service on A:dns-auth for zlib update [14:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:27] that would be all. Thank you for all the support and hints! [14:01:36] sukhe: if you make it a cookbook it could downtime that for you ;) [14:01:39] happy to help! and thanks for working on the vue version of the dashboard :) [14:02:39] volans: yeah, this restart cycle has made me realize that we need cookbooks for some of these things for sure [14:03:00] I don't think the BGP error will or should go away because of the nature of the anycast setup, but other things can use cookbooks [14:03:11] also beats typing things manually :P [14:03:42] :) [14:03:46] * sukhe adds to five-year plan [14:04:55] we've some good abastraction for that so it can be quite easy to write one, I'll do a lightning talk next week about it too (and a open discussion too) [14:05:25] thanks! yeah, for dns-auth, dns-rec, doh, it can be a good start [14:05:35] a lot of service overlap there: bird, anycast, pdns-rec, etc. [14:05:45] I will watch the talk [14:13:29] 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I believe nowadays this is fixed, in the sense that the alert requires a minimum rps before firing. I'm tentati... [14:18:32] (03CR) 10BCornwall: [C: 03+2] varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [14:26:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T314041)', diff saved to https://phabricator.wikimedia.org/P34789 and previous config saved to /var/cache/conftool/dbconfig/20220915-142612-ladsgroup.json [14:26:16] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:26:18] (03PS1) 10DCausse: Revert "cirrus: Handle transition to elasticsearch 7.10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832323 (https://phabricator.wikimedia.org/T308676) [14:32:34] 10SRE, 10Infrastructure-Foundations, 10LDAP: Add slapd audit logs to backup - https://phabricator.wikimedia.org/T317516 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been added to backups. [14:39:28] 10SRE, 10Privacy Engineering, 10WMF-Legal, 10Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104 (10Dzahn) I completely agree with the premise that this was a net-negative and that it affects our credibility. I am afraid this feedback won't reac... [14:41:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P34790 and previous config saved to /var/cache/conftool/dbconfig/20220915-144118-ladsgroup.json [14:41:50] !log installing libtirpc security updates [14:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:29] (03PS1) 10Herron: dispatch-be1001: apply role::insetup [puppet] - 10https://gerrit.wikimedia.org/r/832505 (https://phabricator.wikimedia.org/T313229) [14:45:40] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: remove Icinga monitoring for phd supervising processes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) (owner: 10Dzahn) [14:53:33] (03CR) 10Andrew Bogott: [C: 03+2] toolviews.py: add daily prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/832355 (owner: 10Andrew Bogott) [14:56:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P34791 and previous config saved to /var/cache/conftool/dbconfig/20220915-145625-ladsgroup.json [14:59:16] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:59:20] (03CR) 10Herron: [C: 03+2] "standard vm setup, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/832505 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [14:59:42] (03CR) 10JMeybohm: Add cookbook to restart/reboot the Docker registry (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 (owner: 10Muehlenhoff) [15:00:54] (03CR) 10Volans: "missed one thing earlier" [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 (owner: 10Muehlenhoff) [15:02:42] (03PS2) 10BCornwall: Prometheus: Remove ATS gauge periods [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) [15:04:32] (03CR) 10JMeybohm: [C: 03+1] wikikube-etcd alias: Also include staging hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832482 (owner: 10Muehlenhoff) [15:05:07] (03CR) 10Muehlenhoff: [C: 03+2] wikikube-etcd alias: Also include staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/832482 (owner: 10Muehlenhoff) [15:11:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T314041)', diff saved to https://phabricator.wikimedia.org/P34792 and previous config saved to /var/cache/conftool/dbconfig/20220915-151131-ladsgroup.json [15:11:36] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:18:05] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:rack/setup/install X - https://phabricator.wikimedia.org/T317892 (10RobH) [15:18:15] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:rack/setup/install X - https://phabricator.wikimedia.org/T317892 (10RobH) [15:18:55] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore,name=eqiad [15:22:14] RECOVERY - Check systemd state on sessionstore1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:16] RECOVERY - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is OK: SSL OK - Certificate sessionstore1001-a valid until 2023-02-22 11:12:05 +0000 (expires in 159 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:22:25] 10SRE, 10ops-codfw, 10DC-Ops: Q2:rack/setup/install puppetdb2003 - https://phabricator.wikimedia.org/T317894 (10RobH) [15:22:39] 10SRE, 10ops-codfw, 10DC-Ops: Q2:rack/setup/install puppetdb2003 - https://phabricator.wikimedia.org/T317894 (10RobH) [15:22:56] !log starting cassandra on sessionstore1001-a [15:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:21] (03CR) 10Cwhite: "Please excuse me if the questions seem ignorant. I know less about what cookbooks are capable of than I ought to." [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [15:23:40] RECOVERY - cassandra-a service on sessionstore1001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:23:44] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10RobH) [15:24:08] RECOVERY - cassandra-a CQL 10.64.0.144:9042 on sessionstore1001 is OK: TCP OK - 0.000 second response time on 10.64.0.144 port 9042 https://phabricator.wikimedia.org/T93886 [15:27:14] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: sync [15:27:32] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync [15:28:35] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=eqiad [15:34:50] (03CR) 10Dzahn: [C: 03+2] planet: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/832466 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:37:10] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [15:37:51] (03CR) 10Dzahn: mediawiki::api: fix kernel parameter name ip_local_port_range (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831629 (https://phabricator.wikimedia.org/T317454) (owner: 10Dzahn) [15:39:51] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:06] (03CR) 10Dzahn: [C: 03+1] "looks good to me, one nitpick inside about data types" [puppet] - 10https://gerrit.wikimedia.org/r/832460 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [15:44:12] (03CR) 10Jelto: [V: 03+1] buildkitd: add option to enable proxy settings for buildkitd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832460 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [15:45:53] (03CR) 10Volans: [C: 03+1] "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [15:45:56] 10SRE-OnFire, 10serviceops, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Clement_Goubert) Logs at the moment of the incident point to an etcd.Client uncaught exception. ` Sep 8 15:17:23 conf2005 etcdmirror-conftool-eqia... [15:46:05] (03CR) 10Dzahn: [C: 03+1] buildkitd: add option to enable proxy settings for buildkitd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832460 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [15:50:59] (03PS3) 10Samtar: rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [15:55:39] (03CR) 10Herron: [C: 03+1] "This will be great thanks for putting it together!" [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220915T1600). [16:00:05] musikanimal: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:14] o/ [16:00:44] musikanimal: hey! looking, one sec [16:01:25] okay, heads up: I didn't really have any way of testing this patch. However, what we're trying to do is basically exactly the same thing as Extension:Score, so I just copied that code and changed it accordingly [16:02:20] I'm hoping you can tell me if this does what we need. I.e. this link should work after the patch is merged: https://upload.wikimedia.beta.wmflabs.org/phonos/0/h/0hp7eif2wwbuhif94n42bzm95o71z9i.mp3 [16:02:48] the file should be there, it's just not exposed due to the Swift rewrite rules [16:03:27] musikanimal: hm, okay -- I think I'd like to help find you a reviewer more familiar with swift, rather than stamp this as part of the puppet request window [16:03:55] I'm not confident enough in the subject matter to be comfortable being the only person reviewing :) sorry for the added delay, let me see if I can get this looked at soonish [16:03:56] oh, I may have wrongly assumed the puppet deployers would know about these things :) sorry! [16:04:35] no, not a bad guess! I'm happy you gave it a try, just sorry I can't be more help :) [16:04:37] this is a shame though, because QA can't test our extension right now because it isn't usable on Beta [16:04:51] ah understood [16:05:35] I only figured out what was wrong by circumstance (code search for "Regexp failed to match URI") [16:05:35] godog: you don't happen to still be online, do you? [16:06:47] also pinging legoktm since you seemed to know something about Swift. The patch we're discussing is here and we need someone knowledgeable on Swift to review it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/831955/ [16:07:34] lego is also familiar with what we're trying to do with Phonos [16:07:37] (03PS1) 10Andrew Bogott: Temporarily remove bd808's keys [puppet] - 10https://gerrit.wikimedia.org/r/832513 [16:07:39] sorry, at work [16:07:46] I can look in the evening [16:07:52] no problem, thank you :) [16:08:42] don't worry about rushing to find someone to review this, rzl. If we have to wait until Tuesday, we'll make do! [16:10:05] (03PS1) 10Andrew Bogott: Temporarily disable bd808's key [labs/private] - 10https://gerrit.wikimedia.org/r/832514 [16:10:25] (03CR) 10Andrew Bogott: [C: 03+2] Temporarily remove bd808's keys [puppet] - 10https://gerrit.wikimedia.org/r/832513 (owner: 10Andrew Bogott) [16:10:46] musikanimal: okay, sounds good -- I'm asking around behind the scenes, will still get you unblocked if I can :) [16:11:09] alrighty, thank you!! :) [16:11:18] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Temporarily disable bd808's key [labs/private] - 10https://gerrit.wikimedia.org/r/832514 (owner: 10Andrew Bogott) [16:13:50] (03PS1) 10JMeybohm: Add missing dashboard links to k8s related alerts [alerts] - 10https://gerrit.wikimedia.org/r/832517 [16:16:03] (03CR) 10CI reject: [V: 04-1] Add missing dashboard links to k8s related alerts [alerts] - 10https://gerrit.wikimedia.org/r/832517 (owner: 10JMeybohm) [16:16:24] !log andrew@cumin1001 START - Cookbook sre.idm.logout Logging BryanDavis out of all services on: 2047 hosts [16:17:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging BryanDavis out of all services on: 2047 hosts [16:17:55] (03PS1) 10Hashar: Use gerrit-deploy for deployment on devtools [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/832518 (https://phabricator.wikimedia.org/T317412) [16:35:35] (03CR) 10Jelto: [V: 03+1 C: 03+2] buildkitd: add option to enable proxy settings for buildkitd [puppet] - 10https://gerrit.wikimedia.org/r/832460 (https://phabricator.wikimedia.org/T308271) (owner: 10Jelto) [16:37:06] PROBLEM - Apache HTTP on mw2383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:08] PROBLEM - Apache HTTP on mw2270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:10] PROBLEM - Apache HTTP on mw2391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:10] PROBLEM - Apache HTTP on mw2363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:16] PROBLEM - Apache HTTP on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:18] PROBLEM - Apache HTTP on mw2361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:18] PROBLEM - Apache HTTP on mw2371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:37:20] PROBLEM - Apache HTTP on mw2333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:20] PROBLEM - Apache HTTP on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:20] PROBLEM - Apache HTTP on mw2392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:22] PROBLEM - Apache HTTP on mw2388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:22] PROBLEM - Apache HTTP on mw2409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:22] PROBLEM - Apache HTTP on mw2412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:24] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:26] PROBLEM - Apache HTTP on mw2311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:26] PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:26] PROBLEM - Apache HTTP on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:30] PROBLEM - Apache HTTP on mw2309 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:30] PROBLEM - Apache HTTP on mw2307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:30] PROBLEM - Apache HTTP on mw2338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:30] PROBLEM - Apache HTTP on mw2359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:30] PROBLEM - Apache HTTP on mw2369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:36] PROBLEM - Apache HTTP on mw2274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:44] exactly two of those are eqiad, fascinating [16:37:46] PROBLEM - Apache HTTP on mw2375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:37:54] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe1001.eqiad.wmnet, thanos-fe1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:37:56] PROBLEM - Apache HTTP on mw2303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:00] PROBLEM - Apache HTTP on mw2271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:00] PROBLEM - Apache HTTP on mw2275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:04] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2313.codfw.wmnet, mw2271.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2378.codfw.wmnet, mw2371.codfw.wmnet, mw2331.codfw.wmnet, mw2392.codfw.wmnet, mw2359.codfw.wmnet, mw2414.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2310.codfw.wmnet, mw2413.codfw.wmnet, mw2316.codfw.wmnet, mw2303.codfw.wmnet, mw [16:38:04] fw.wmnet, mw2393.codfw.wmnet, mw2314.codfw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2276.codfw.wmnet, mw2369.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2406.codfw.wmnet, mw2408.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2373.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2339.codfw.wmnet, mw2337.codfw.wmnet, mw2415.codfw.wmnet, mw2377.codfw.wmnet, mw2385.codfw.wmnet, mw2274.codfw [16:38:04] mw2277.codfw.wmnet, mw2384.codfw.wmnet, mw2305.codfw.wmnet, mw2388.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2379.codfw.wmnet, mw2389.codfw.wmnet, mw2383.codfw.wmnet, mw230 https://wikitech.wikimedia.org/wiki/PyBal [16:38:04] PROBLEM - Apache HTTP on mw2314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:08] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:38:09] Hey, eny error ongoing? Every wiki page says "upstream connect error or disconnect/reset before headers. reset reason: overflow" [16:38:10] PROBLEM - Apache HTTP on mw2310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:12] PROBLEM - Apache HTTP on mw2335 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:12] PROBLEM - Apache HTTP on mw2301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:12] PROBLEM - Apache HTTP on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:14] PROBLEM - Apache HTTP on mw2379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:14] PROBLEM - Apache HTTP on mw2407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:14] PROBLEM - Apache HTTP on mw2406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:17] (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:38:18] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2313.codfw.wmnet, mw2271.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2392.codfw.wmnet, mw2371.codfw.wmnet, mw2415.codfw.wmnet, mw2393.codfw.wmnet, mw2312.codfw.wmnet, mw2310.codfw.wmnet, mw2413.codfw.wmnet, mw2316.codfw.wmnet, mw2329.codfw.wmnet, mw2325.codfw.wmnet, mw2379.codfw.wmnet, mw2275.codfw.wmnet, mw [16:38:18] fw.wmnet, mw2369.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2406.codfw.wmnet, mw2408.codfw.wmnet, mw2315.codfw.wmnet, mw2373.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2339.codfw.wmnet, mw2337.codfw.wmnet, mw2377.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2277.codfw.wmnet, mw2384.codfw.wmnet, mw2305.codfw.wmnet, mw2388.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2380.codfw.wmnet, mw2268.codfw [16:38:18] mw2273.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2336.codfw.wmnet, mw2303.codfw.wmnet, mw2391.codfw.wmnet, mw2387.codfw.wmnet, mw2407.codfw.wmnet, mw2359.codfw.wmnet, mw236 https://wikitech.wikimedia.org/wiki/PyBal [16:38:19] (ProbeDown) firing: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:24] RECOVERY - Apache HTTP on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:38:30] RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:38:34] RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:38:34] RECOVERY - Apache HTTP on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:38:35] MdsShakil: yep, looking, stand by [16:38:42] PROBLEM - Thanos swift https on thanos-fe2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Thanos [16:38:44] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 1979 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:39:04] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method= [16:39:24] PROBLEM - PHP7 rendering on mw2385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:27] PROBLEM - PHP7 rendering on mw2274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:27] PROBLEM - PHP7 rendering on mw2331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:27] PROBLEM - PHP7 rendering on mw2361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:28] PROBLEM - PHP7 rendering on mw2384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:30] PROBLEM - PHP7 rendering on mw2309 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:30] PROBLEM - PHP7 rendering on mw2412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:32] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:39:36] PROBLEM - Thanos swift https on thanos-fe1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.047 second response time https://wikitech.wikimedia.org/wiki/Thanos [16:39:38] PROBLEM - PHP7 rendering on mw2388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:38] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [16:39:40] PROBLEM - PHP7 rendering on mw2312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:40] PROBLEM - PHP7 rendering on mw2310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:40] PROBLEM - PHP7 rendering on mw2363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:40] PROBLEM - PHP7 rendering on mw2373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:40] PROBLEM - PHP7 rendering on mw2393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:40] PROBLEM - Apache HTTP on mw2327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:44] PROBLEM - Apache HTTP on mw2378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:44] PROBLEM - PHP7 rendering on mw2386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:44] PROBLEM - Maps HTTPS on maps1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:39:52] PROBLEM - Maps HTTPS on maps1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:39:52] PROBLEM - Maps HTTPS on maps1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:39:52] PROBLEM - PHP7 rendering on mw2268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:53] rzl: switch down? [16:40:00] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:40:12] PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:40:20] PROBLEM - PHP7 rendering on mw2273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:22] PROBLEM - Thanos swift https on thanos-fe1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Thanos [16:40:22] PROBLEM - Thanos swift https on thanos-fe1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Thanos [16:40:22] PROBLEM - PHP7 rendering on mw2339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:22] PROBLEM - PHP7 rendering on mw2337 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:24] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:40:32] PROBLEM - PHP7 rendering on mw2338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:32] PROBLEM - PHP7 rendering on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:32] RECOVERY - Apache HTTP on mw2271 is OK: HTTP OK: HTTP/1.1 302 Found - 504 bytes in 2.863 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:34] PROBLEM - PHP7 rendering on mw2389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:34] RECOVERY - Apache HTTP on mw2314 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:36] RECOVERY - Apache HTTP on mw2275 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.580 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:36] RECOVERY - PHP7 rendering on mw2385 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:38] RECOVERY - PHP7 rendering on mw2331 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:38] RECOVERY - PHP7 rendering on mw2361 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:38] RECOVERY - PHP7 rendering on mw2274 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:40] RECOVERY - Apache HTTP on mw2310 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:40] RECOVERY - PHP7 rendering on mw2384 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:42] RECOVERY - PHP7 rendering on mw2309 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:42] RECOVERY - PHP7 rendering on mw2412 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:44] RECOVERY - Apache HTTP on mw2335 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:44] RECOVERY - Apache HTTP on mw2316 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:44] RECOVERY - Apache HTTP on mw2301 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:48] RECOVERY - Apache HTTP on mw2406 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:48] RECOVERY - Apache HTTP on mw2379 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:48] RECOVERY - Apache HTTP on mw2407 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:52] RECOVERY - PHP7 rendering on mw2388 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:52] RECOVERY - PHP7 rendering on mw2312 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:52] RECOVERY - PHP7 rendering on mw2310 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:52] RECOVERY - PHP7 rendering on mw2363 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:52] RECOVERY - PHP7 rendering on mw2393 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:52] RECOVERY - PHP7 rendering on mw2373 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:54] RECOVERY - Apache HTTP on mw2327 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:56] RECOVERY - Apache HTTP on mw2378 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:57] RECOVERY - PHP7 rendering on mw2386 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:57] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [16:41:00] RECOVERY - Apache HTTP on mw2383 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:02] RECOVERY - Apache HTTP on mw2270 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:04] RECOVERY - Apache HTTP on mw2391 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:04] RECOVERY - Apache HTTP on mw2363 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:04] RECOVERY - PHP7 rendering on mw2268 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:41:12] RECOVERY - Apache HTTP on mw2361 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:12] RECOVERY - Apache HTTP on mw2371 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:16] RECOVERY - Apache HTTP on mw2392 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:16] RECOVERY - Apache HTTP on mw2333 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:16] RECOVERY - Apache HTTP on mw2336 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:18] RECOVERY - Apache HTTP on mw2388 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:18] RECOVERY - Apache HTTP on mw2409 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:18] RECOVERY - Apache HTTP on mw2412 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:18] RECOVERY - Apache HTTP on mw2311 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:24] RECOVERY - Apache HTTP on mw2309 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:24] RECOVERY - Apache HTTP on mw2307 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:24] RECOVERY - Apache HTTP on mw2338 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:26] RECOVERY - Apache HTTP on mw2359 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:26] RECOVERY - Apache HTTP on mw2369 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:32] RECOVERY - Apache HTTP on mw2274 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:44] RECOVERY - PHP7 rendering on mw2273 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:41:47] RECOVERY - Apache HTTP on mw2375 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:41:47] RECOVERY - PHP7 rendering on mw2339 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:41:47] RECOVERY - PHP7 rendering on mw2337 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:41:56] RECOVERY - PHP7 rendering on mw2316 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:41:56] RECOVERY - PHP7 rendering on mw2338 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:42:00] RECOVERY - Apache HTTP on mw2303 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:42:00] RECOVERY - PHP7 rendering on mw2389 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:42:18] (ProbeDown) firing: (24) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:08] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 10 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:43:17] (PHPFPMTooBusy) resolved: (2) Not enough idle php7.2-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:43:18] (ProbeDown) firing: (24) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:28] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) @BCornwall: part of the kerberos credential setup used to be that an email was sent to Hannah with instructions on how to reset a temporary password (https://wikitec... [16:44:01] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10KFrancis) @HasanAkgun_WMDE Thank you! I'm just waiting for legal counsel to sign on my end. I'll let you all know when it's complete. [16:44:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:45:12] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:45:14] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:45:24] PROBLEM - Thanos swift https on thanos-fe2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Thanos [16:46:39] (03PS1) 10DDesouza: Increase coverage of Research Incentive Survey on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832526 (https://phabricator.wikimedia.org/T316466) [16:47:18] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Dzahn) I don't think there is a definition what "private" means without going into detail. It can mean "users can only subscribe with approval", it can mean "archiv... [16:47:24] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:46] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:48:45] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:48:57] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [16:49:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:50:20] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:52:56] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:53:06] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:16] (03CR) 10Filippo Giunchedi: "[I've been asked for a review, giving this a quick look]" [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [16:56:36] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) Hi, @Milimetric. No need for a new email, I consider that to be part of the same issue. I'm assuming Hannah checked their spam folder? :) @BTullis when you kindly se... [16:58:16] PROBLEM - Maps HTTPS on maps1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:58:34] PROBLEM - Thanos swift https on thanos-fe2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Thanos [17:00:05] bd808: Your horoscope predicts another unfortunate Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220915T1700). [17:05:46] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-swift_443: Servers thanos-fe1001.eqiad.wmnet, thanos-fe1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:06:16] (ThanosSidecarBucketOperationsFailed) firing: (14) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [17:08:15] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BTullis) >>! In T317545#8240048, @BCornwall wrote: > Hi, @Milimetric. No need for a new email, I consider that to be part of the same issue. I'm assuming Hannah checked their sp... [17:09:55] dancy, jeena: are either of you available to do a backport deploy before the train? [17:10:02] PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-store.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:08] cscott: yes [17:10:25] we have a fix for T317857, not sure how to schedule that since we don't have another backport window beforethe train deploy [17:10:26] T317857: Table of Contents not displayed in all skins except Vector 2022 when DiscussionTools is enabled on the page - https://phabricator.wikimedia.org/T317857 [17:10:30] can you hold off a moment? we're still trying to root-cause that last outage, want to make sure everything is stable [17:11:01] (03PS1) 10C. Scott Ananian: Use more permissive match for TOC_PLACEHOLDER in parser output [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/832547 (https://phabricator.wikimedia.org/T317857) [17:11:09] cscott: If you're confident in the fix I can deploy it at the beginning of the next train window (~50 minutes) [17:11:16] (ThanosSidecarBucketOperationsFailed) firing: (36) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [17:11:24] yeah, i'm pretty confident. we tested it both locally and on beta. [17:11:42] dancy: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/832547 is the backport to 1.40.0-wmf.1 [17:11:47] 👍🏾 I'll take care of it then. Feel free to +2 it in the meantime. [17:12:22] (03CR) 10C. Scott Ananian: [C: 03+2] "C+2ed w/ permission of dancy, who will deploy at the start of the next train deploy window." [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/832547 (https://phabricator.wikimedia.org/T317857) (owner: 10C. Scott Ananian) [17:12:30] ok, done. [17:13:12] i'll be online during the train window. [17:14:09] (03CR) 10MusikAnimal: rewrite.py: changes for Phonos deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [17:14:58] Great. See you then [17:15:13] rzl: Please ping me when you're satisfied. [17:15:26] dancy: will do, thanks for your patience [17:30:52] (03Merged) 10jenkins-bot: Use more permissive match for TOC_PLACEHOLDER in parser output [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/832547 (https://phabricator.wikimedia.org/T317857) (owner: 10C. Scott Ananian) [17:35:52] thanks dancy!👍 [17:38:26] dancy: we've narrowed the only ongoing problems to swift, so a MW deploy shouldn't be directly affected -- but since thanos is one of the affected components, our monitoring isn't at 100% and I'd still prefer to delay anything non-urgent -- will keep you updated [17:39:06] Ok [17:39:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:39:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:44:50] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:58] (03PS26) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) [17:50:15] (03PS2) 10Jbond: C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines [puppet] - 10https://gerrit.wikimedia.org/r/832268 (https://phabricator.wikimedia.org/T317799) [17:50:55] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Hokwelum) I'm sorry @Milimetric, @BCornwall, and @BTullis, I didn’t check my spam earlier but I just saw the mail now :-) [17:52:00] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:54:22] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:00] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Masssly) That makes sense. We’ll change the settings according to our needs. To clarify what we'll be doing, users can only subscribe with approval, and the archiv... [18:00:05] dancy and jeena: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220915T1800). Please do the needful. [18:01:36] Holding the train until rzl reports readiness. [18:03:35] (03PS1) 10Filippo Giunchedi: thanos: have envoy listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/832532 [18:03:53] rzl cwhite ^ [18:03:56] looking [18:04:17] (03CR) 10Cwhite: [C: 03+1] thanos: have envoy listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/832532 (owner: 10Filippo Giunchedi) [18:04:39] (03CR) 10RLazarus: [C: 03+1] thanos: have envoy listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/832532 (owner: 10Filippo Giunchedi) [18:04:43] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: have envoy listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/832532 (owner: 10Filippo Giunchedi) [18:05:21] dancy: status is godog identified the root cause and we expect that puppet patch to be the fix, you should be good to go as soon as we get it validated and rolled out [18:05:32] thanks folks, running puppet etc [18:06:08] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache thanos-fe2002.codfw.wmnet on all recursors [18:06:10] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) thanos-fe2002.codfw.wmnet on all recursors [18:06:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:07:20] !log restart envoyproxy on thanos-fe* [18:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:10] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10Milimetric) Ok, cool. After kinit-ing successfully I noticed Hannah is now in analytics-admin but not analytics-privatedata-users. She needs both of those the way it's current... [18:13:09] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [18:14:09] 10SRE, 10SRE-swift-storage: Get swift (and its components) ready for v6 - https://phabricator.wikimedia.org/T317909 (10fgiunchedi) [18:15:10] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:15:50] !log depool wcqs2001 for T316236 [18:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:53] T316236: Reload WCQS from dumps - https://phabricator.wikimedia.org/T316236 [18:16:09] 10SRE, 10SRE-swift-storage: Get swift (and its components) ready for v6 - https://phabricator.wikimedia.org/T317909 (10fgiunchedi) cc'ing {T271138} for tracking purposes [18:16:37] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache thanos-fe1001.eqiad.wmnet on all recursors [18:16:41] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) thanos-fe1001.eqiad.wmnet on all recursors [18:16:42] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache thanos-fe1002.eqiad.wmnet on all recursors [18:16:45] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) thanos-fe1002.eqiad.wmnet on all recursors [18:16:46] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache thanos-fe1003.eqiad.wmnet on all recursors [18:16:49] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) thanos-fe1003.eqiad.wmnet on all recursors [18:16:58] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache thanos-fe2001.codfw.wmnet on all recursors [18:17:00] RECOVERY - Thanos swift https on thanos-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Thanos [18:17:01] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) thanos-fe2001.codfw.wmnet on all recursors [18:17:02] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache thanos-fe2002.codfw.wmnet on all recursors [18:17:05] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) thanos-fe2002.codfw.wmnet on all recursors [18:17:06] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache thanos-fe2003.codfw.wmnet on all recursors [18:17:09] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) thanos-fe2003.codfw.wmnet on all recursors [18:17:18] (ProbeDown) firing: (2) Service thanos-swift:443 has failed probes (http_thanos-swift_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-swift:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:17:44] RECOVERY - Thanos swift https on thanos-fe2003 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Thanos [18:17:54] RECOVERY - Check systemd state on thanos-fe2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:10] RECOVERY - Thanos swift https on thanos-fe2002 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Thanos [18:18:14] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:18:18] RECOVERY - Thanos swift https on thanos-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Thanos [18:18:18] RECOVERY - Thanos swift https on thanos-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Thanos [18:18:18] (ProbeDown) resolved: (2) Service thanos-swift:443 has failed probes (http_thanos-swift_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-swift:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:18:24] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:18:31] nice! [18:18:46] that transient ProbeDown is interesting, must have been something about the switch back [18:18:46] RECOVERY - Thanos swift https on thanos-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Thanos [18:18:59] indeed [18:19:13] dashboards are starting to look recovered also [18:19:22] dancy: you can expect an all-clear in about five minutes or so, assuming no surprises [18:19:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [18:19:32] 👍🏾 [18:19:40] RECOVERY - Maps HTTPS on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 9.624 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:21:16] (ThanosSidecarBucketOperationsFailed) firing: (36) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [18:21:31] (ThanosSidecarBucketOperationsFailed) firing: (36) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [18:21:36] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:21:42] hmm, tegola graphs are recovering but varnish failed upload fetches are still elevated [18:22:18] (ProbeDown) resolved: (2) Service thanos-swift:443 has failed probes (http_thanos-swift_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-swift:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:23:26] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:23:34] (03PS1) 10BCornwall: admin: Add hokwelum to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/832534 (https://phabricator.wikimedia.org/T317545) [18:24:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [18:25:12] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 6.659 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:26:06] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 5.059 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:26:16] (ThanosSidecarBucketOperationsFailed) resolved: (36) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [18:26:32] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 1.945 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:26:34] RECOVERY - Maps HTTPS on maps1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 4.586 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:26:36] RECOVERY - Maps HTTPS on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:26:50] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 4.082 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:27:16] RECOVERY - Maps HTTPS on maps1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.674 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:27:18] RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 2.723 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:27:18] dancy: okay, fire away :) thanks again for waiting [18:27:35] No problem. Pressing the button now! [18:27:48] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.545 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:27:52] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 4.220 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:28:32] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 8.685 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:29:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/832547 (https://phabricator.wikimedia.org/T317857) (owner: 10C. Scott Ananian) [18:29:14] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:29:24] !log dancy@deploy1002 Started scap: Backport for [[gerrit:832547|Use more permissive match for TOC_PLACEHOLDER in parser output (T317857)]] [18:29:28] T317857: Table of Contents not displayed in all skins except Vector 2022 when DiscussionTools is enabled on the page - https://phabricator.wikimedia.org/T317857 [18:29:50] !log dancy@deploy1002 dancy and cscott: Backport for [[gerrit:832547|Use more permissive match for TOC_PLACEHOLDER in parser output (T317857)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [18:33:07] Jdlrobson: Ready for testing [18:34:43] I tested using the instructions in the ticket. A/B behavior confirmed. [18:35:00] Proceeding [18:37:18] (03PS1) 10Ebernhardson: [DNM] Compile puppet catalog for review [puppet] - 10https://gerrit.wikimedia.org/r/832535 [18:37:48] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:52] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:16] !log restart thanos-compact (thanos-fe2001) and swift_ring_manager (thanos-fe1001) [18:38:17] (03CR) 10Ahmon Dancy: profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar) [18:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:33] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37274/console" [puppet] - 10https://gerrit.wikimedia.org/r/832535 (owner: 10Ebernhardson) [18:38:57] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [18:39:18] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:832547|Use more permissive match for TOC_PLACEHOLDER in parser output (T317857)]] (duration: 09m 53s) [18:39:21] T317857: Table of Contents not displayed in all skins except Vector 2022 when DiscussionTools is enabled on the page - https://phabricator.wikimedia.org/T317857 [18:40:42] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:40:43] (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832536 (https://phabricator.wikimedia.org/T314190) [18:40:45] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832536 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [18:41:26] (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832536 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [18:43:45] (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:45:34] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.1 refs T314190 [18:45:38] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [18:53:24] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [18:58:00] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [18:58:26] (03CR) 10Hashar: spec_helper: include the monkey patch for the actual spec tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832469 (owner: 10Jbond) [18:58:43] (03PS2) 10Hashar: spec_helper: include the monkey patch for the actual spec tests [puppet] - 10https://gerrit.wikimedia.org/r/832469 (owner: 10Jbond) [19:03:30] dancy: yep lgtm [19:03:34] sorry for the delay! [19:03:55] No prob [19:05:47] dancy: LGTM as well, thanks so much! [19:08:08] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:25:23] (03PS1) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [19:25:35] (03Abandoned) 10Ebernhardson: [DNM] Compile puppet catalog for review [puppet] - 10https://gerrit.wikimedia.org/r/832535 (owner: 10Ebernhardson) [19:26:29] !log pool'd wdqs2001, some blockers before reload can start T316236 [19:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:33] T316236: Reload WCQS from dumps - https://phabricator.wikimedia.org/T316236 [19:26:55] (03CR) 10CI reject: [V: 04-1] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [19:34:40] (03PS2) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [19:35:38] (03CR) 10CI reject: [V: 04-1] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [19:36:50] PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:38:54] (03PS3) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [19:39:49] (03CR) 10CI reject: [V: 04-1] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [19:40:14] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10KFrancis) @BCornwall I am confirming the NDA has been signed. Please proceed with the access request. Thanks! [19:44:20] (03PS4) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [19:45:13] (03CR) 10CI reject: [V: 04-1] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [19:45:28] (03CR) 10Hashar: [C: 03+1] "Looks like it is addressing my use case (running "rake spec" from a module directory). Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/832469 (owner: 10Jbond) [19:46:39] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37278/console" [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [19:49:11] (03PS5) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [19:50:03] (03CR) 10CI reject: [V: 04-1] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [19:51:51] (03PS6) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [19:51:58] (03PS1) 10Jdlrobson: Update collapsed TOC menu width [skins/Vector] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/832551 (https://phabricator.wikimedia.org/T316056) [19:52:41] (03PS2) 10Jdlrobson: EXPECTED VISUAL CHANGES FOR 1.40.0-wmf.1 [skins/Vector] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/832551 (https://phabricator.wikimedia.org/T316056) [19:55:05] (03PS7) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [19:58:35] (03CR) 10CI reject: [V: 04-1] Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [20:00:04] brennen and TheresNoTime: gettimeofday() says it's time for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220915T2000) [20:00:04] inflatador and danisztls: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:54] o/ I can be brennen today [20:02:34] inflatador: dani around? [20:03:01] (03PS8) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [20:03:14] thcipriani confirmed [20:03:33] hey inflatador [20:03:49] o/ [20:04:15] hey danisztls [20:05:22] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37283/console" [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson) [20:06:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832323 (https://phabricator.wikimedia.org/T308676) (owner: 10DCausse) [20:07:25] (03Merged) 10jenkins-bot: Revert "cirrus: Handle transition to elasticsearch 7.10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832323 (https://phabricator.wikimedia.org/T308676) (owner: 10DCausse) [20:07:40] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:832323|Revert "cirrus: Handle transition to elasticsearch 7.10" (T308676)]] [20:07:44] T308676: Elasticsearch 7.10.2 rollout plan - https://phabricator.wikimedia.org/T308676 [20:08:00] !log thcipriani@deploy1002 thcipriani and dcausse: Backport for [[gerrit:832323|Revert "cirrus: Handle transition to elasticsearch 7.10" (T308676)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:08:31] ^ inflatador your change is on mwdebug, check please! [20:08:35] (if possible) [20:10:19] thcipriani ACK, just checked and it looks good. Proceed at your convenience! [20:10:30] inflatador: thanks, doing now [20:13:20] (03PS1) 10BCornwall: admin: Add Hasan Akgün (haak) to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/832566 (https://phabricator.wikimedia.org/T317637) [20:15:19] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:832323|Revert "cirrus: Handle transition to elasticsearch 7.10" (T308676)]] (duration: 07m 39s) [20:15:23] T308676: Elasticsearch 7.10.2 rollout plan - https://phabricator.wikimedia.org/T308676 [20:15:29] (03CR) 10Dzahn: [C: 03+1] "lgtm, usually for WMDE it's both LDAP groups, wmde and nda." [puppet] - 10https://gerrit.wikimedia.org/r/832566 (https://phabricator.wikimedia.org/T317637) (owner: 10BCornwall) [20:15:32] ^ inflatador should be live everywhere! [20:15:37] danisztls: you're up [20:15:52] thcipriani: yep [20:16:10] (03CR) 10BCornwall: [C: 03+2] admin: Add Hasan Akgün (haak) to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/832566 (https://phabricator.wikimedia.org/T317637) (owner: 10BCornwall) [20:17:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832526 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [20:17:48] (03Merged) 10jenkins-bot: Increase coverage of Research Incentive Survey on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832526 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [20:18:02] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:832526|Increase coverage of Research Incentive Survey on idwiki (T316466)]] [20:18:06] T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466 [20:18:07] thcipriani: thanks! [20:18:22] !log thcipriani@deploy1002 thcipriani and dani: Backport for [[gerrit:832526|Increase coverage of Research Incentive Survey on idwiki (T316466)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:18:39] danisztls: ^ live on mwdebug servers, check please [20:19:23] thcipriani: I don't think I can test this change but it looks good on mwdebug [20:20:31] danisztls: gotcha, thanks for checking, going live [20:20:47] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10Dzahn) Even though this ticket only asks for the group "wmde", i am almost certain it should be both "wmde" and "mda" because it always has been for other WMDE em... [20:21:21] (03CR) 10Ryan Kemper: [C: 03+2] opensearch: replace outdated config [puppet] - 10https://gerrit.wikimedia.org/r/826383 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [20:22:43] brett: I'm merging this for ya if that's okay `Brett Cornwall: admin: Add Hasan Akgün (haak) to ldap_only_users (2f5b20e05c)` [20:22:50] (03CR) 10Bking: [V: 03+2 C: 03+1] apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [20:22:52] (03CR) 10Bking: [V: 03+2 C: 03+2] apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [20:23:00] Sure! Thanks [20:23:06] (done) [20:25:09] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:832526|Increase coverage of Research Incentive Survey on idwiki (T316466)]] (duration: 07m 06s) [20:25:13] T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466 [20:25:25] ^ danisztls should be live now [20:25:30] (03CR) 10Dzahn: [C: 03+1] admin: Add hokwelum to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/832534 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall) [20:25:53] thcipriani: thanks [20:27:43] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall) I went ahead and clarified that in Wikitech: https://wikitech.wikimedia.org/w/index.php?title=SRE/Clinic_Duty/Access_requests&diff=2012791&oldid=201278... [20:28:12] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Dzahn) You can find more info here on how to configure the ssh client to get to hosts that have private IPs vi... [20:35:32] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Dzahn) @Masssly Hello, I tried to create this list for you and to follow the docs at https://wikitech.wikimedia.org/wiki/Mailman#Create_a_mailing_list But unfortun... [20:37:12] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Dzahn) >>! In T317851#8238507, @Peachey88 wrote: > This appears to already exist, per {T274582} and https://lists.wikimedia.org/postorius/lists/dagbani.lists.wikime... [20:39:35] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Dzahn) >>! In T317851#8238655, @Masssly wrote: > The already created public list above was for the entirety of the community. This new ticket is requesting a closed... [20:39:48] mutante: ignore my comment on the task, it doesn't exist, the mailing list request was changed after that comment [20:40:00] PROBLEM - Host db1189.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:40:37] (03PS9) 10Ebernhardson: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) [20:40:42] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Dzahn) @BCornwall Do you wanna try to create it and see if you can repro? There is also the web interface option. [20:41:17] p858snake: oh, ok. thank you. well, maybe I am just reporting a bug with the shell wrapper [20:41:27] see latest comment [20:41:50] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Replaced Failed Memory. [20:42:02] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) 05Open→03Resolved [20:43:59] I don't really know why the name is illegal. [20:46:22] RECOVERY - Host db1189.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [20:49:24] mutante: I haven't tried myself but it looks like the create command wants a full name including "@lists.wikimedia.org", not just the part before the @ [20:49:37] ignore me if that's what you were already doing :) [20:50:39] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Masssly) >>! In T317851#8240741, @Dzahn wrote: >>>! In T317851#8238655, @Masssly wrote: >> The already created public list above was for the entirety of the communi... [20:50:42] rzl: that was it! it worked. I had not created one ever since mm2 [20:50:47] thank you [20:50:49] \o/ [20:51:32] yea, the old scripts just wanted short name [20:52:46] https://wikitech.wikimedia.org/w/index.php?title=Mailman&diff=2012792&oldid=1988655 for the next person [20:53:14] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Dzahn) @Masssly Please check your email inbox now:) It was that it wanted the full list name "dagbani-kpamba-mgmt@lists.wikimedia.org". [20:53:27] aww. excellent [20:54:08] (03CR) 10Cwhite: Add a cookbook to restart/reboot logstash collector/Kibana nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/832447 (owner: 10Muehlenhoff) [20:54:39] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10BCornwall) p:05Triage→03Medium [20:54:53] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Dzahn) I could only specific a single "original admin". Please add the secondary admins using those intial powers. [20:56:15] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Dzahn) a:05bking→03pfischer [20:58:57] 10SRE, 10Infrastructure-Foundations: raid_mgmt_tools cannot detect raid on clouddb1021 - https://phabricator.wikimedia.org/T317924 (10colewhite) [20:59:14] 10SRE, 10Infrastructure-Foundations: raid_mgmt_tools cannot detect raid on clouddb1021 - https://phabricator.wikimedia.org/T317924 (10colewhite) [20:59:16] 10Puppet, 10SRE, 10Infrastructure-Foundations: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10colewhite) [20:59:59] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10BCornwall) a:05BCornwall→03KFrancis [21:00:25] 10Puppet, 10SRE, 10Infrastructure-Foundations: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10colewhite) 05Open→03Resolved a:03colewhite Rolled back the changes and only one host experienced a regression. Created T317924 to handle that host. [21:00:59] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Masssly) "kpamba" means “management” in Dagbani so it sounds a bit odd to have both terms in the name :) If it's not a lot of work, can we remove the -mgmt part so... [21:01:21] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall) a:05BCornwall→03HasanAkgun_WMDE [21:01:34] (03PS1) 10Cwhite: smart: remove unused function get_raid_drivers [puppet] - 10https://gerrit.wikimedia.org/r/832570 (https://phabricator.wikimedia.org/T251293) [21:03:04] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Masssly) @3245 As the primary admin, can you please check your email and proceed with the rest of the settings? [21:03:11] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Dzahn) >>! In T317851#8240798, @Masssly wrote: > "kpamba" means “management” in Dagbani so it sounds a bit odd to have both terms in the name :) Oh, thanks for tha... [21:04:16] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Masssly) Thank you very much, Dzahn \o/ [21:04:59] (03PS1) 10Andrew Bogott: Remove hiera for cloudcontrol2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/832571 [21:05:01] (03PS1) 10Andrew Bogott: keystone: refactor fernet key rotation timing [puppet] - 10https://gerrit.wikimedia.org/r/832572 (https://phabricator.wikimedia.org/T317838) [21:08:05] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10Novem_Linguae) In general, shouldn't phabricator tickets be one ticket = one cause? This one seems like it may be one ticket = many causes since it's suc... [21:08:34] 10SRE, 10Wikimedia-Mailing-lists: Create non-public Dagbani Wikimedians Usergroup Team mailing list - https://phabricator.wikimedia.org/T317851 (10Dzahn) 05Open→03Resolved a:03Dzahn You are welcome! for completeness Removed list: dagbani-kpamba-mgmt@lists.wikimedia.org [21:09:26] (03CR) 10BCornwall: [C: 03+2] admin: Add hokwelum to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/832534 (https://phabricator.wikimedia.org/T317545) (owner: 10BCornwall) [21:10:10] (03PS2) 10BCornwall: admin: Add hokwelum to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/832534 (https://phabricator.wikimedia.org/T317545) [21:16:13] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10KFrancis) @Dzahn I do have an unrelated question. I will be on vacation from Sept. 19-30 and need to change my delegation for the NDA workflow while I'm out. Do you know how to do th... [21:16:29] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) Okay, @Milimetric, that should be solved. She's in analytics-privatedata-users now. Would you say that's everything needed? [21:16:56] (03PS2) 10Andrew Bogott: Remove hiera for cloudcontrol2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/832571 [21:16:58] (03PS2) 10Andrew Bogott: keystone: refactor fernet key rotation timing [puppet] - 10https://gerrit.wikimedia.org/r/832572 (https://phabricator.wikimedia.org/T317838) [21:17:00] (03PS1) 10Andrew Bogott: dynamicproxy: remove a comment about the log rotation happening hourly [puppet] - 10https://gerrit.wikimedia.org/r/832573 [21:18:58] (03PS3) 10Andrew Bogott: keystone: refactor fernet key rotation timing [puppet] - 10https://gerrit.wikimedia.org/r/832572 (https://phabricator.wikimedia.org/T317838) [21:19:07] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: remove a comment about the log rotation happening hourly [puppet] - 10https://gerrit.wikimedia.org/r/832573 (owner: 10Andrew Bogott) [21:19:19] (03CR) 10Andrew Bogott: [C: 03+2] Remove hiera for cloudcontrol2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/832571 (owner: 10Andrew Bogott) [21:20:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:22:50] (03CR) 10Andrew Bogott: [C: 03+2] keystone: refactor fernet key rotation timing [puppet] - 10https://gerrit.wikimedia.org/r/832572 (https://phabricator.wikimedia.org/T317838) (owner: 10Andrew Bogott) [21:24:48] (03CR) 10Hashar: phabricator: remove Icinga monitoring for phd supervising processes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) (owner: 10Dzahn) [21:25:52] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10RhinosF1) >>! In T317637#8240849, @KFrancis wrote: > @Dzahn I do have an unrelated question. I will be on vacation from Sept. 19-30 and need to change my delegation for the NDA workfl... [21:25:58] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (GET configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:30:12] !log depool wcqs2001 for T316236 [21:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:16] T316236: Reload WCQS from dumps - https://phabricator.wikimedia.org/T316236 [21:38:02] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:11] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10Dzahn) Hi @KFrancis :) There are 2 (ore more) ways we can go about it. One is what RhinosF1 said, one can lookup who is on clinic duty for specific dates. The schedule is at https:/... [21:42:48] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10Dzahn) @KFrancis And yet another way is we can ask ITS via techsupport@ to create a Google mail alias, something like nda-requests@ and then you get to control what that forwards to yo... [22:01:20] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs2001.codfw.wmnet with reason: T316236 [22:01:24] T316236: Reload WCQS from dumps - https://phabricator.wikimedia.org/T316236 [22:01:45] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs2001.codfw.wmnet with reason: T316236 [22:03:10] RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 0 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:03:34] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10KFrancis) @BBlack and @JMeybohm -If you receive a request for an NDA between September 19-30, 2022, please contact Rachel Stallman at @RStallman-legalteam for processing as I will be O... [22:03:36] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:05:40] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:06:02] 10SRE, 10Observability-Alerting, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Dzahn) After pondering this a bit more I now think the _actual fix_ would be if Wikipedia and other projects just also fix the same punctuation issue that Wikibooks fixed... [22:07:02] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:27:26] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Dzahn) [22:27:57] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Dzahn) 05Open→03Resolved >>! In T316090#8216642, @pfischer wrote: > @Gehel, looks good to me, I'm at least... [22:33:52] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Dzahn) here is an example for a wdqs server: FQDN: wdqs1009.eqiad.wmnet [wdqs1009:~] $ id pfischer uid=41223... [22:40:29] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10Dzahn) >>! In T317637#8240688, @BCornwall wrote: > I went ahead and clarified that in Wikitech: https://wikitech.wikimedia.org/w/index.php?title=SRE/Clinic_Duty/Access_requests&diff=20... [22:40:50] RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:41:23] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10Dzahn) 05In progress→03Resolved a:05HasanAkgun_WMDE→03BCornwall [22:42:13] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10Dzahn) a:05BCornwall→03None [22:44:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10KFrancis) @BCornwall Please provide Tanuja's wmde email address and I'll process this. Thanks! [22:44:29] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10KFrancis) Please send to kfrancis@wikimedia.org [22:46:32] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [22:47:01] 10SRE, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10Zabe) It was now also implemented that members of ldap/wmde should be added to #wmf-nda, see https://wikitech.wikimedia.org/w/index.php?title=SRE/Clinic_Duty/Access_requests... [22:47:23] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10Dzahn) >>! In T317613#8240974, @KFrancis wrote: > Please send to kfrancis@wikimedia.org just mailed it to you [22:57:58] (03CR) 10Dzahn: "ah, yea, with the new path I can confirm the situation is like this:" [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [23:03:23] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10KFrancis) Thanks so much! The agreement is out for signatures. [23:04:38] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:05:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:07:00] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:13:28] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:18:02] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:21:30] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:50:54] (03CR) 10Dzahn: [C: 03+2] gerrit: rm redundant service_params ensure => running [puppet] - 10https://gerrit.wikimedia.org/r/832411 (owner: 10Hashar) [23:51:00] (03PS3) 10Dzahn: gerrit: rm redundant service_params ensure => running [puppet] - 10https://gerrit.wikimedia.org/r/832411 (owner: 10Hashar) [23:51:21] !log gerrit1001 - disabled puppet - gerrit:832411 [23:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:31] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:54:28] (03CR) 10Dzahn: "deployed first gerrit2001 then gerrit1001 - noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/832411 (owner: 10Hashar) [23:54:54] (03CR) 10Dzahn: [C: 03+2] "2002 of course, 2001 is no more" [puppet] - 10https://gerrit.wikimedia.org/r/832411 (owner: 10Hashar)