[00:02:09] ^ for fun, count how many robots are talking to each other, when that happens [00:08:26] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:23:58] (03CR) 10Jforrester: "Oops. Thank you for this." [core] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709789 (https://phabricator.wikimedia.org/T287988) (owner: 10Urbanecm) [00:48:21] (03CR) 10Ottomata: [C: 03+1] analytics web: create htdocs subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/709822 (owner: 10Ryan Kemper) [01:10:00] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:26:04] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) FWIW, this error message comes from En... [02:11:38] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 6 (dbprov1001, ...), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:45:34] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10RandomCanadian) Safe-mode also prevents usage of "include" in the lilypond (necessary for some stuff like gr... [02:48:24] (03PS1) 10Tim Starling: Update cli.inc for renamed core commandLine.inc [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709795 [02:48:38] (03PS1) 10Tim Starling: Update cli.inc for renamed core commandLine.inc [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709796 [02:48:59] (03CR) 10Tim Starling: [C: 03+2] Update cli.inc for renamed core commandLine.inc [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709795 (owner: 10Tim Starling) [02:49:12] (03CR) 10Tim Starling: [C: 03+2] Update cli.inc for renamed core commandLine.inc [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709796 (owner: 10Tim Starling) [02:54:01] (03Merged) 10jenkins-bot: Update cli.inc for renamed core commandLine.inc [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709795 (owner: 10Tim Starling) [02:54:03] (03Merged) 10jenkins-bot: Update cli.inc for renamed core commandLine.inc [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709796 (owner: 10Tim Starling) [03:11:06] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:11:44] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:27:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:53] (03PS1) 10Tim Starling: Add scripts for 2021 voter qualification [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709797 [03:44:59] (03CR) 10Tim Starling: [C: 03+2] Add scripts for 2021 voter qualification [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709797 (owner: 10Tim Starling) [03:45:21] (03PS1) 10Tim Starling: Add scripts for 2021 voter qualification [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709798 [03:45:27] (03CR) 10Tim Starling: [C: 03+2] Add scripts for 2021 voter qualification [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709798 (owner: 10Tim Starling) [03:49:22] (03Merged) 10jenkins-bot: Add scripts for 2021 voter qualification [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709797 (owner: 10Tim Starling) [03:49:24] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:49] (03Merged) 10jenkins-bot: Add scripts for 2021 voter qualification [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709798 (owner: 10Tim Starling) [03:50:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:20] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:08] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/SecurePoll: for bv2021/populateEditCount.php (duration: 01m 18s) [03:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [03:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:22] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/SecurePoll: for bv2021/populateEditCount.php (duration: 01m 06s) [03:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:03] !log on mwmaint2002: creating bv2021_edits table on all wikis [04:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:22] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:14:18] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:22] PROBLEM - Disk space on releases1002 is CRITICAL: DISK CRITICAL - free space: /srv/docker 1023 MB (0% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [04:26:41] 10ops-eqiad, 10DBA: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T287137 (10Marostegui) 05Open→03Resolved This is all good now: ` root@db1175:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : P... [04:34:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1174 to clone db1127 T286763', diff saved to https://phabricator.wikimedia.org/P16948 and previous config saved to /var/cache/conftool/dbconfig/20210804-043438-marostegui.json [04:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:47] T286763: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 [04:38:28] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) I am cloning db1127 from db1174 [04:41:01] 10SRE, 10Release-Engineering-Team: releases1002 /srv/docker DISK SPACE alert - https://phabricator.wikimedia.org/T288024 (10Marostegui) [04:45:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312 to clone db1170:3312 T286888', diff saved to https://phabricator.wikimedia.org/P16950 and previous config saved to /var/cache/conftool/dbconfig/20210804-044507-marostegui.json [04:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:15] T286888: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 [04:47:47] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) I am cloning db1170:3312 from db1105:3312 s7 part needs to wait as: ` root@cumin1001:~# dbctl instance db1101:3317 depool Execution FAILED Reported errors: Section s7 is supposed to... [04:54:08] !log on mwmaint2002: running bv2021/populateEditCounts.php on all wikis with one thread per section s1-s8 [04:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:20] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:07:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311', diff saved to https://phabricator.wikimedia.org/P16952 and previous config saved to /var/cache/conftool/dbconfig/20210804-050751-marostegui.json [05:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:08] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:11:37] (03PS1) 10Marostegui: production-m5.sql.erb: Add toolhub grants [puppet] - 10https://gerrit.wikimedia.org/r/709877 (https://phabricator.wikimedia.org/T271480) [05:35:01] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP entry for testvm2002, running on ganeti-test01 [puppet] - 10https://gerrit.wikimedia.org/r/709736 (owner: 10Muehlenhoff) [05:35:16] !log docker image prune on releases1002, T288024 [05:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:23] T288024: releases1002 /srv/docker DISK SPACE alert - https://phabricator.wikimedia.org/T288024 [05:36:41] 10SRE, 10Analytics: Import the openjdk8 packages in Bullseye - https://phabricator.wikimedia.org/T287960 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff Sure thing, I'll take care of this next week. [05:38:31] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10MoritzMuehlenhoff) p:05Triage→03Low [05:39:14] 10SRE, 10Release-Engineering-Team: releases1002 /srv/docker DISK SPACE alert - https://phabricator.wikimedia.org/T288024 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03hashar Antoine, could you please have a look whether we can free something? [05:43:58] RECOVERY - Disk space on releases1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops [05:49:43] 10SRE, 10observability: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10MoritzMuehlenhoff) [05:54:10] 10SRE, 10Release-Engineering-Team: releases1002 /srv/docker DISK SPACE alert - https://phabricator.wikimedia.org/T288024 (10Joe) I've done a `docker image prune -a` on that server, but I think we will need to give it a larger docker partition given the amount of images we're building there. [05:56:20] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Ladsgroup) This might be helpful: {T113114} I th... [05:56:21] 10SRE, 10observability: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10Marostegui) +1 to this. I only found this useful in a case were we offboarded someone and I found a screen (with no activity) days after, but this was a looooong time ago. [05:59:05] 10SRE, 10observability: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10Joe) +1 I don't remember it being useful once for me, while annoying me plenty of times. And now we have the logout cookbook too, so even during offboarding it's not really useful. [06:03:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1170:3312, db1105:3312, db1105:3311 T286888', diff saved to https://phabricator.wikimedia.org/P16953 and previous config saved to /var/cache/conftool/dbconfig/20210804-060347-marostegui.json [06:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:55] T286888: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 [06:03:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) db1170:3312 is now up and running with GTID enabled and repooled Will start db1170:3317 as soon as the other s7 transfer is done. [06:15:50] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10MoritzMuehlenhoff) >>! In T286776#7256015, @KFrancis wrote: > @RLazarus I am confirming the NDA has been signed. Please proceed with the access request. Thanks! Thanks @KFra... [06:17:17] 10SRE, 10LDAP-Access-Requests: LDAP Access Request for WMDE Employee - Elena Aleynikova - https://phabricator.wikimedia.org/T286776 (10MoritzMuehlenhoff) 05Open→03Resolved @elal : I've added you to the cn=nda and cn=wmde LDAP groups. You should now be able to access Superset. If you run into any issues, pl... [06:25:26] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [06:28:17] (03PS2) 10Ladsgroup: Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) [06:36:39] dcausse: o/ if you are around, an-airflow1001 needs some disk cleanup [06:36:43] (otherwise I can do it) [06:38:45] moritzm: docker filing releases is known issue. Some Jenkins job does not reclaim intermediate layers/containers. Luckily I got it on a dedicated partition [06:39:00] Thx for the ping! [06:39:35] elukey: looking, thanks for the heads up! [06:39:46] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) Cloned - waiting for replication to catch up [06:43:03] (03PS1) 10Marostegui: Revert "db1127: Disable notificactions." [puppet] - 10https://gerrit.wikimedia.org/r/709799 [06:43:26] hashar: ack! do we have a task for the underlying issue? if there's no immediate fix on the Jenkins side we should add a systemd timer to trigger a cleanup before this escalates to alerts [06:43:43] (03CR) 10Marostegui: [C: 03+2] Revert "db1127: Disable notificactions." [puppet] - 10https://gerrit.wikimedia.org/r/709799 (owner: 10Marostegui) [06:45:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1174 and db1127 T286763', diff saved to https://phabricator.wikimedia.org/P16954 and previous config saved to /var/cache/conftool/dbconfig/20210804-064548-marostegui.json [06:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:56] T286763: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 [06:46:11] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) Host pooled, GTID enabled, notifications enabled. All sorted. [06:48:55] (03CR) 10Ladsgroup: [C: 03+1] Remove DynamicPageList from all Wikimania wikis except 2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709585 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [06:53:30] !log installing testvm2002 T286206 [06:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:37] T286206: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 [07:02:26] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10ayounsi) Still nothing on console. Can you also make sure mgmt is connected? [07:16:12] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:23:45] moritzm: we had the same issue earlier this week, last week but it just got manually fixed. So I guess we can use today task to track the proper fix [07:24:06] (03PS1) 10Muehlenhoff: Add elal to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/709941 (https://phabricator.wikimedia.org/T286776) [07:24:10] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:25:42] hashar: ok [07:26:00] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10ema) >>! In T287983#7257682, @RLazarus wrote: >... [07:26:08] (03CR) 10Muehlenhoff: [C: 03+2] Add elal to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/709941 (https://phabricator.wikimedia.org/T286776) (owner: 10Muehlenhoff) [07:27:08] (03PS1) 10Filippo Giunchedi: pontoon: fix path to curl [puppet] - 10https://gerrit.wikimedia.org/r/709942 [07:27:17] 10SRE, 10Release-Engineering-Team: releases1002 /srv/docker DISK SPACE alert - https://phabricator.wikimedia.org/T288024 (10hashar) [07:28:44] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: fix path to curl [puppet] - 10https://gerrit.wikimedia.org/r/709942 (owner: 10Filippo Giunchedi) [07:29:06] 10SRE, 10Release-Engineering-Team: releases1002 /srv/docker DISK SPACE alert - https://phabricator.wikimedia.org/T288024 (10hashar) a:05hashar→03None That is routinely filing up due to some Jenkins job creating images/containers but not reclaiming them at end of build. @dancy / @dduvall will know the detai... [07:40:43] (03PS1) 10Filippo Giunchedi: netops: temporarily skip externallabels in alerts [alerts] - 10https://gerrit.wikimedia.org/r/709944 [07:42:20] 10SRE, 10Release-Engineering-Team: releases1002 /srv/docker DISK SPACE alert - https://phabricator.wikimedia.org/T288024 (10MoritzMuehlenhoff) If there's no immediate fix on the Jenkins side we should add a systemd timer to trigger a cleanup before this escalates to alerts [07:43:57] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) I have started https://wikitech.wikimedia.org/wiki/Kafka/Administration#Rebalance_... [07:44:21] XioNoX: FYI https://gerrit.wikimedia.org/r/c/operations/alerts/+/709944/ [07:44:33] (03CR) 10Filippo Giunchedi: [C: 03+2] netops: temporarily skip externallabels in alerts [alerts] - 10https://gerrit.wikimedia.org/r/709944 (owner: 10Filippo Giunchedi) [07:45:27] ok! [07:46:27] (03PS1) 10Filippo Giunchedi: alerts: run pytest as needed [puppet] - 10https://gerrit.wikimedia.org/r/709945 (https://phabricator.wikimedia.org/T284810) [07:46:29] (03PS1) 10Filippo Giunchedi: alerts: fix glob selection logic [puppet] - 10https://gerrit.wikimedia.org/r/709946 (https://phabricator.wikimedia.org/T284810) [07:51:06] seeking kind souls for two quick reviews ^ [07:51:31] looking [07:51:50] thank you moritzm, appreciate it [07:53:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/709945 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [07:53:46] 10SRE, 10observability: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10fgiunchedi) I'm also +1 on ditching the alert [07:54:22] do you also get misalignment of the circles around names in updated gerrit? it is making me twitch [07:55:18] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10ema) >>! In T254317#7255820, @elukey wrote: > In theory a lot of `tls = '-'` should be redirects from http to https, that hit Varnish and... [07:55:27] https://phabricator.wikimedia.org/F34575510 that is [07:55:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/709946 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [07:57:00] thanks moritzm [07:57:05] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: run pytest as needed [puppet] - 10https://gerrit.wikimedia.org/r/709945 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [07:57:13] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: fix glob selection logic [puppet] - 10https://gerrit.wikimedia.org/r/709946 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [08:00:53] !log upgrade prometheus1003 - T222113 [08:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:02] T222113: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 [08:01:08] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) A http to https redirect is probably not really a webrequest (following https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Tr... [08:04:07] (03PS9) 10JMeybohm: Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) [08:05:53] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30473/console" [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:06:48] (03PS2) 10Legoktm: Add shellbox-constraints to LVS [puppet] - 10https://gerrit.wikimedia.org/r/709566 (https://phabricator.wikimedia.org/T285104) [08:06:50] (03PS2) 10Legoktm: service: Switch shellbox-constraints to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/709567 (https://phabricator.wikimedia.org/T285104) [08:06:52] (03PS2) 10Legoktm: service: Switch shellbox-constraints to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/709568 (https://phabricator.wikimedia.org/T285104) [08:06:54] (03PS2) 10Legoktm: service: Switch shellbox-constraints to production [puppet] - 10https://gerrit.wikimedia.org/r/709569 (https://phabricator.wikimedia.org/T285104) [08:08:21] vgutierrez: OK to merge the first patch? "Add shellbox-constraints to LVS" [08:08:31] please go ahead [08:08:56] (03CR) 10Legoktm: [C: 03+2] Add shellbox-constraints to LVS [puppet] - 10https://gerrit.wikimedia.org/r/709566 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [08:09:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:10:21] (03PS1) 10MVernon: Packaging: add Depends on curl [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709949 [08:10:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but please have a plan for cleanup 😄" [puppet] - 10https://gerrit.wikimedia.org/r/709740 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:10:34] done, ready for moving to lvs_setup? [08:10:50] 10SRE, 10observability: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10jcrespo) > If I remember correctly, originally it was introduced to detect/prevent cases where a recurring DB maintenance tasks was running in a user's screen session continuously I don't... [08:11:06] legoktm: yep, lvs_setup CR shows the expected changes https://puppet-compiler.wmflabs.org/compiler1002/30474/ [08:11:08] yw :-) [08:11:36] PROBLEM - Prometheus k8s cache not updating on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [08:11:42] PROBLEM - Prometheus k8s-mlserve cache not updating on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [08:11:48] PROBLEM - Prometheus k8s-staging cache not updating on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [08:12:23] checking, kinda expected [08:12:33] (03PS3) 10JMeybohm: site: Switch a bunch of eqiad appservers to appserver_dragonfly role [puppet] - 10https://gerrit.wikimedia.org/r/709740 (https://phabricator.wikimedia.org/T286054) [08:12:56] the backup LVS servers are lvs1016 and lvs2010 right? [08:12:57] legoktm: indeed [08:13:07] and active are lvs1015, lvs2009 [08:13:07] ok [08:13:12] that's right [08:13:33] (03PS1) 10MVernon: wmf-update-ssh-config: add option to skip systemd activation [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709950 [08:13:46] (03CR) 10Vgutierrez: [C: 03+1] service: Switch shellbox-constraints to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/709567 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [08:13:51] (03CR) 10Legoktm: [C: 03+2] service: Switch shellbox-constraints to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/709567 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [08:14:26] running puppet on all LVS servers now [08:15:11] ok, even you'd be safe just hitting a puppet run on 1015,1016,2009 and 2010 :) [08:15:24] RECOVERY - Prometheus k8s cache not updating on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [08:15:28] RECOVERY - Prometheus k8s-mlserve cache not updating on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [08:15:31] https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers says to do it everywhere [08:15:36] RECOVERY - Prometheus k8s-staging cache not updating on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [08:15:41] legoktm: sure, no problem with that [08:16:20] legoktm: it will be a NOOP (regarding that CR) on high-traffic1 and high-traffic2 load balancers [08:16:25] gotcha [08:16:49] your puppet run will trigger some icinga noise [08:16:59] yeah, I'll ack those when they pop up [08:17:24] OK to restart pybal on lvs1016? [08:17:27] PyBal IPVS diff check won't be happy [08:17:32] legoktm: yes, go ahead [08:18:29] !log restarting pybal on lvs1016 to add shellbox-constraints service [08:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:46] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.61:4010]) https://wikitech.wikimedia.org/wiki/PyBal [08:18:56] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 59 connections established with conf2004.codfw.wmnet:4001 (min=60) https://wikitech.wikimedia.org/wiki/PyBal [08:19:11] BGP back on lvs1016 :) [08:19:53] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 59 connections established with conf2004.codfw.wmnet:4001 (min=60) Legoktm Deploying new service (shellbox-constraints) https://wikitech.wikimedia.org/wiki/PyBal [08:20:03] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.61:4010]) Legoktm Deploying new service (shellbox-constraints) https://wikitech.wikimedia.org/wiki/PyBal [08:20:32] `journalctl -u pybal --since today | grep shellbox` shows it as well [08:20:38] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.61:4010]) https://wikitech.wikimedia.org/wiki/PyBal [08:20:38] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Add a temporary role for appservers plus docker and dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/709719 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:20:50] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.61:4010]) https://wikitech.wikimedia.org/wiki/PyBal [08:20:51] legoktm: curl http://localhost:9090/pools/shellbox-constraints_4010 [08:21:01] you can see there all the appservers for the new service up and pooled :) [08:21:15] oooh [08:21:40] 10SRE, 10Infrastructure-Foundations, 10netops: Create an alert for output discards on network devices - https://phabricator.wikimedia.org/T284593 (10ayounsi) Current values for `ifOutDiscards_delta`: > 'asw2-b-eqiad.mgmt.eqiad.wmnet', 'xe-7/0/41', '509' > 'asw2-b-eqiad.mgmt.eqiad.wmnet', 'xe-2/0/41', '1601'... [08:21:41] GET against /pools will give you the list of available pools on that pybal instance BTW [08:21:46] ok, ready to restart on lvs2010 now? [08:21:51] yes [08:22:08] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.61:4010]) Legoktm Deploying new service (shellbox-constraints) https://wikitech.wikimedia.org/wiki/PyBal [08:22:08] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.61:4010]) Legoktm Deploying new service (shellbox-constraints) https://wikitech.wikimedia.org/wiki/PyBal [08:22:14] (03PS4) 10JMeybohm: site: Switch a bunch of eqiad appservers to appserver_dragonfly role [puppet] - 10https://gerrit.wikimedia.org/r/709740 (https://phabricator.wikimedia.org/T286054) [08:22:22] !log restarting pybal on lvs2010 to add shellbox-constraints service [08:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:06] BGP up in lvs2010, and shellbox-constraints pool is happy :) [08:23:38] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 67 connections established with conf1004.eqiad.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [08:24:02] great, ready for lvs1015 now? [08:24:04] 10SRE, 10Analytics, 10Traffic, 10Patch-For-Review: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 (10elukey) I had a chat with Ema on IRC, reporting a summary: * At the current state of the TLS termination layer, it is likely that ATS-TLS... [08:24:12] legoktm: yep [08:24:38] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:24:54] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:24:55] !log restarting pybal on lvs1015 to add shellbox-constraints service [08:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:12] (03PS3) 10Volans: sre.ganeti.makevm: make error message more explicit [cookbooks] - 10https://gerrit.wikimedia.org/r/709706 [08:25:49] legoktm: all good in lvs1015 as well :) [08:26:20] great, set for lvs2009 now? [08:26:32] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:26:45] legoktm: sure [08:27:00] damn logmsgbot messes my nick completion :/ [08:27:10] !log restarting pybal on lvs2009 to add shellbox-constraints service [08:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:01] legoktm: nice :D, BGP running on lvs2009 and shellbox-constraints looking happy there as well [08:29:03] woot [08:29:08] onto monitoring_setup now? [08:29:22] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:29:30] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 68 connections established with conf1004.eqiad.wmnet:4001 (min=68) https://wikitech.wikimedia.org/wiki/PyBal [08:29:49] legoktm: go ahead [08:29:53] hm, the last thing on that step is to test with curl [08:30:10] but I didn't merge https://gerrit.wikimedia.org/r/c/operations/dns/+/709571 yet - maybe I could've done that already? [08:30:42] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 60 connections established with conf2004.codfw.wmnet:4001 (min=60) https://wikitech.wikimedia.org/wiki/PyBal [08:31:37] legoktm: yeah, that one can be merged already [08:31:52] ok, let me do that first so we can test it works [08:32:01] (03PS5) 10JMeybohm: site: Switch a bunch of eqiad appservers to appserver_dragonfly role [puppet] - 10https://gerrit.wikimedia.org/r/709740 (https://phabricator.wikimedia.org/T286054) [08:32:03] (03PS1) 10JMeybohm: mediawiki::appserver_dragonfly: Fix docker package name [puppet] - 10https://gerrit.wikimedia.org/r/709952 (https://phabricator.wikimedia.org/T286054) [08:32:18] at DNS level the one that requires a specific service state is https://gerrit.wikimedia.org/r/c/operations/dns/+/709572/ [08:32:28] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:32:31] (03PS2) 10Legoktm: Add shellbox-constraints.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/709571 (https://phabricator.wikimedia.org/T285104) [08:32:33] (03PS2) 10Legoktm: Add shellbox-constraints to discovery [dns] - 10https://gerrit.wikimedia.org/r/709572 (https://phabricator.wikimedia.org/T285104) [08:32:36] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:33:04] (03CR) 10Legoktm: [C: 03+2] Add shellbox-constraints.svc.{codfw,eqiad}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/709571 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [08:34:14] (03CR) 10JMeybohm: [C: 03+2] mediawiki::appserver_dragonfly: Fix docker package name [puppet] - 10https://gerrit.wikimedia.org/r/709952 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [08:35:56] sweet, `curl https://shellbox-constraints.svc.codfw.wmnet:4010/healthz` works now :D [08:35:56] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.9 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/709953 [08:36:09] so does eqiad [08:36:37] vgutierrez: ok, now moving to monitoring_setup [08:36:42] (03CR) 10Legoktm: [C: 03+2] service: Switch shellbox-constraints to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/709568 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [08:38:33] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) Thanos graphs for topics with more than 0 msg/s for: - [[ https://thanos.wikimedi... [08:39:38] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.9 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/709953 (owner: 10Volans) [08:41:19] 10SRE: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10JMeybohm) I've just seen this on mw1384 while installing dragonfly-dfdaemon. [08:41:34] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:41:54] !log pool prometheus1003 (and depool prometheus1004 for testing 1003 only) - T222113 [08:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:01] T222113: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 [08:42:08] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.9 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/709953 (owner: 10Volans) [08:42:54] (03CR) 10Kormat: [C: 03+1] mediawiki: Remove old 'parser_cache_purging' job [puppet] - 10https://gerrit.wikimedia.org/r/702427 (owner: 10Krinkle) [08:43:03] legoktm: checks marked as UP in icinga, nice [08:43:05] monitoring in icinga looks good to me, [08:43:39] time for switching to "production"? [08:44:13] it looks like it :) [08:44:37] (03CR) 10Legoktm: [C: 03+2] service: Switch shellbox-constraints to production [puppet] - 10https://gerrit.wikimedia.org/r/709569 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [08:44:43] (03PS3) 10Legoktm: service: Switch shellbox-constraints to production [puppet] - 10https://gerrit.wikimedia.org/r/709569 (https://phabricator.wikimedia.org/T285104) [08:44:52] 10SRE, 10observability: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:46:41] (03PS1) 10JMeybohm: Revert "Create dragonfly user via systemd-sysusers" [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/709803 [08:47:05] (03PS1) 10Volans: Upstream release v0.0.9 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/709955 [08:47:22] (03PS1) 10Btullis: Failover analytics-hive service to standby server [dns] - 10https://gerrit.wikimedia.org/r/709956 (https://phabricator.wikimedia.org/T279304) [08:47:24] 10SRE, 10observability: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10jcrespo) I dug deeper and the causes seems to be a 2017 incident mentioned on the meeting notes as: > screen "api-hhvm-restarts" on neodymium restarted a bunch of api servers on Fri (scree... [08:47:30] 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia Enterprise (Okapi Wikimedia Enterprise): Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams - https://phabricator.wikimedia.org/T280628 (10AnnaMikla) [08:48:40] PROBLEM - DPKG on mw1384 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:51:52] vgutierrez: ok, ran puppet against A:icinga or A:dns-auth. Do I need a sudo authdns-update too? Or am I all set to merge the final DNS change? [08:51:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [08:52:03] https://gerrit.wikimedia.org/r/c/operations/dns/+/709572 specifically [08:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:22] 10SRE, 10Traffic, 10Patch-For-Review, 10Wikimedia Enterprise (Okapi Wikimedia Enterprise): "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10AnnaMikla) [08:52:52] legoktm: merge the DNS change and then authdns-update :) [08:53:11] (03CR) 10Legoktm: [C: 03+2] Add shellbox-constraints to discovery [dns] - 10https://gerrit.wikimedia.org/r/709572 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [08:53:27] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.9 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/709955 (owner: 10Volans) [08:53:35] * legoktm does [08:54:05] (03PS1) 10Filippo Giunchedi: pontoon: allow access from DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/709957 [08:54:32] (03PS2) 10Btullis: Failover analytics-hive service to standby server [dns] - 10https://gerrit.wikimedia.org/r/709956 (https://phabricator.wikimedia.org/T279304) [08:55:18] !log legoktm@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=shellbox-constraints [08:55:20] legoktm: FYI I'm editing https://netbox.wikimedia.org/ipam/ip-addresses/8845/ to set the netmask to /32 as it's a VIP [08:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:53] volans: thanks, not sure how I missed that [08:56:00] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: allow access from DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/709957 (owner: 10Filippo Giunchedi) [08:56:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=minio site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:56:15] no worries, it defaults to the prefix's netmask [08:56:23] do I need to do anything else for that change to take effect? [08:56:33] no [08:56:48] all good, thx [08:56:55] (03Merged) 10jenkins-bot: Upstream release v0.0.9 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/709955 (owner: 10Volans) [08:57:36] vgutierrez: discovery DNS works, I think we're all set now? [08:57:38] volans: nice catch, I checked netbox but I didn't notice the netmask :( [08:57:48] legoktm: that's right [08:58:06] we should add a report to check that [08:58:44] vgutierrez: then thank you :)) [08:59:00] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bawolff) Just to summarize some additional investigation that was done: * The triggering... [08:59:10] thanks for choosing the traffic team edge networking services 🍻 [09:01:39] (03CR) 10Btullis: [C: 03+2] Failover analytics-hive service to standby server [dns] - 10https://gerrit.wikimedia.org/r/709956 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [09:02:17] (03PS1) 10Legoktm: services_proxy: Add envoyproxy for shellbox-constraints [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) [09:03:34] (03CR) 10Legoktm: services_proxy: Add envoyproxy for shellbox-constraints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [09:04:32] (03CR) 10Legoktm: [C: 04-1] Add shellbox-constraint services and use them (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [09:05:28] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) [09:08:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10dcaro) [09:09:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10dcaro) [09:10:15] !log uploaded python3-wmflib_0.0.9 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [09:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2001.codfw.wmnet [09:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:32] (03CR) 10Ladsgroup: services_proxy: Add envoyproxy for shellbox-constraints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [09:13:40] (03PS1) 10Btullis: Failback analytics-hive to the primary server [dns] - 10https://gerrit.wikimedia.org/r/709962 (https://phabricator.wikimedia.org/T279304) [09:14:13] (03PS1) 10Muehlenhoff: Add testvm2001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/709963 [09:14:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709949 (owner: 10MVernon) [09:15:25] (03PS1) 10Ema: varnish: remove Resp time from internal SLI [puppet] - 10https://gerrit.wikimedia.org/r/709964 (https://phabricator.wikimedia.org/T284576) [09:17:30] (03CR) 10Ema: [C: 03+2] varnish: remove Resp time from internal SLI [puppet] - 10https://gerrit.wikimedia.org/r/709964 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema) [09:18:21] 10SRE, 10observability: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10MoritzMuehlenhoff) >>! In T288028#7258214, @jcrespo wrote: > I dug deeper and the causes seems to be a 2017 incident mentioned on the meeting notes as: > >> screen "api-hhvm-restarts" on n... [09:18:38] (03PS2) 10H.krishna123: web_app: Created skeleton code for frontend, with new amendments to api_db and static files [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) [09:20:08] (03CR) 10Muehlenhoff: [C: 03+2] Add testvm2001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/709963 (owner: 10Muehlenhoff) [09:24:45] (03CR) 10H.krishna123: "Ready for review 😊" [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) (owner: 10H.krishna123) [09:25:18] (03PS1) 10Giuseppe Lavagetto: mwdebug: switch to using sockets for fcgi proxying [deployment-charts] - 10https://gerrit.wikimedia.org/r/709986 [09:25:45] (03PS1) 10Muehlenhoff: Add repo hook for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/709987 (https://phabricator.wikimedia.org/T287671) [09:26:30] (03CR) 10H.krishna123: web_app: Created skeleton code for frontend, with new amendments to api_db and static files (031 comment) [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) (owner: 10H.krishna123) [09:26:36] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) a:05ayounsi→03None [09:28:36] (03PS1) 10Phuedx: Use real transactions when creating an election [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709973 (https://phabricator.wikimedia.org/T287859) [09:28:56] (03PS1) 10Phuedx: Use real transactions when creating an election [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709974 (https://phabricator.wikimedia.org/T287859) [09:33:37] (03CR) 10MVernon: [V: 03+1] Packaging: add Depends on curl [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709949 (owner: 10MVernon) [09:35:17] (03CR) 10MVernon: [V: 03+1 C: 03+2] Packaging: add Depends on curl [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709949 (owner: 10MVernon) [09:35:30] (03CR) 10MVernon: [V: 03+2 C: 03+2] Packaging: add Depends on curl [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709949 (owner: 10MVernon) [09:37:08] (03CR) 10Btullis: [C: 03+2] Failback analytics-hive to the primary server [dns] - 10https://gerrit.wikimedia.org/r/709962 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [09:39:34] (03PS5) 10Giuseppe Lavagetto: mediawiki::website: parametrize the fcgi proxy in search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) [09:40:06] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::website: parametrize the fcgi proxy in search.w.o [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [09:43:01] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [09:43:18] (03CR) 10Legoktm: services_proxy: Add envoyproxy for shellbox-constraints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [09:44:15] (03PS2) 10Legoktm: services_proxy: Add envoyproxy for shellbox-constraints [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) [09:44:57] (03CR) 10Lucas Werkmeister (WMDE): services_proxy: Add envoyproxy for shellbox-constraints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [09:45:47] (03PS12) 10Zoranzoki21: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [09:46:25] (03CR) 10Legoktm: services_proxy: Add envoyproxy for shellbox-constraints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [09:46:55] (03CR) 10jerkins-bot: [V: 04-1] Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [09:47:04] (03CR) 10JMeybohm: [C: 03+2] Revert "Create dragonfly user via systemd-sysusers" [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/709803 (owner: 10JMeybohm) [09:48:23] (03CR) 10Ladsgroup: services_proxy: Add envoyproxy for shellbox-constraints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [09:49:53] (03Merged) 10jenkins-bot: Revert "Create dragonfly user via systemd-sysusers" [debs/dragonfly] - 10https://gerrit.wikimedia.org/r/709803 (owner: 10JMeybohm) [09:55:29] (03CR) 10Muehlenhoff: [C: 03+2] Add repo hook for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/709987 (https://phabricator.wikimedia.org/T287671) (owner: 10Muehlenhoff) [09:59:28] (03PS1) 10Zabe: Add happysrv.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709990 (https://phabricator.wikimedia.org/T288039) [10:01:00] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Bawolff) So i missed originally that the ordermethod was set to "created" (aka page_id) o... [10:05:52] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) @jcrespo reported issues with minio scraping by prometheus, and indeed Prometheus' TLS certs validation changed due to a [[ https:/... [10:07:11] (03PS1) 10Martaannaj: Add config for the updated PropertySuggester for test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709991 (https://phabricator.wikimedia.org/T285098) [10:07:20] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10jcrespo) Since around 8-8:40 UTC, minio scrapping is failing on all backup* hosts, with: ` Aug 04 08:08:57 backup1004 minio[2621]: http: TLS ha... [10:08:44] (03CR) 10jerkins-bot: [V: 04-1] Add config for the updated PropertySuggester for test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709991 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [10:10:19] 10SRE, 10SRE Observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10jcrespo) FYI, I am (re)using, perhaps incorrectly, the automatically generated host puppet certs for this (minio)- in case someone else is doing... [10:11:26] 10SRE, 10Data-Persistence-Backup, 10SRE Observability, 10media-backups, and 2 others: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10jcrespo) [10:17:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 T286888', diff saved to https://phabricator.wikimedia.org/P16955 and previous config saved to /var/cache/conftool/dbconfig/20210804-101719-marostegui.json [10:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:29] T286888: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 [10:19:17] (03CR) 10Hashar: [C: 03+1] "Lets go for it" [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) (owner: 10Jbond) [10:19:19] (03CR) 10Ladsgroup: Add config for the updated PropertySuggester for test Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709991 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [10:25:42] (03PS1) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/709975 [10:25:57] (03Abandoned) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/706038 (owner: 10Hashar) [10:29:40] !log importing dragonfly 1.0.6-1 (downgrade from 1.0.6-2) to buster-wikimedia and stretch-wikimedia - T286054 [10:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:47] T286054: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 [10:32:21] (03PS1) 10Elukey: apt-repo: update key for the ROCm repositories [puppet] - 10https://gerrit.wikimedia.org/r/709992 [10:34:12] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) All set for tomorrow's failover! [10:38:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/709992 (owner: 10Elukey) [10:38:19] (03PS2) 10Zabe: Add *.happysrv.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709990 (https://phabricator.wikimedia.org/T288039) [10:40:32] (03PS1) 10Hashar: Review access change [software/mailman-templates] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709976 [10:40:58] (03PS2) 10Hashar: Review access change [software/mailman-templates] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/709976 (https://phabricator.wikimedia.org/T288027) [10:44:33] (03CR) 10JMeybohm: [C: 03+2] site: Switch a bunch of eqiad appservers to appserver_dragonfly role [puppet] - 10https://gerrit.wikimedia.org/r/709740 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [10:48:44] !log switch most eqiad appservers to appserver_dragonly role for testing - T286054 [10:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:52] T286054: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 [10:52:11] RECOVERY - DPKG on mw1384 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:53:59] !log running puppet on eqiad appservers [10:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1100). [11:00:04] phuedx: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] \o [11:00:12] o/ [11:00:16] phuedx: do you want to self-service? [11:00:50] urbanecm: Can do [11:00:57] go ahead then :) [11:01:01] I'm around if needed [11:01:20] urbanecm: It's the same patch applied to both branches. I'll test one on mwdebug and if it's OK, then I'll deploy both. Sound OK? [11:01:38] yup [11:02:21] (03CR) 10Phuedx: [C: 03+2] "BACKPORT!!1" [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709973 (https://phabricator.wikimedia.org/T287859) (owner: 10Phuedx) [11:02:28] (03CR) 10Phuedx: [C: 03+2] "BACKPORT!!1" [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709974 (https://phabricator.wikimedia.org/T287859) (owner: 10Phuedx) [11:08:20] (03Merged) 10jenkins-bot: Use real transactions when creating an election [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709973 (https://phabricator.wikimedia.org/T287859) (owner: 10Phuedx) [11:08:23] (03Merged) 10jenkins-bot: Use real transactions when creating an election [extensions/SecurePoll] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709974 (https://phabricator.wikimedia.org/T287859) (owner: 10Phuedx) [11:12:01] Pulling onto mwdebug2001 [11:12:27] Testing now [11:12:47] (03PS1) 10MVernon: icinga: add MVernon to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/710000 [11:17:06] I was able to create a test election with "For wikis" set to "All wikis" from votewiki with the patch applied. This LGTM [11:20:02] Syncing -wmf.16 [11:20:27] (03PS1) 10LSobanski: icinga::ircbot: Send database notifications to #wikimedia-data-persistence [puppet] - 10https://gerrit.wikimedia.org/r/710002 (https://phabricator.wikimedia.org/T283580) [11:21:26] !log phuedx@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/SecurePoll: Backport: [[gerrit:709973|Use real transactions when creating an election]] (duration: 01m 19s) [11:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:42] Sycning -wmf.17 [11:22:49] *Syncing [11:23:45] (03CR) 10LSobanski: "Could you take a look? I believe you created the original config. I don't think anything else needs changing but I may be missing somethin" [puppet] - 10https://gerrit.wikimedia.org/r/710002 (https://phabricator.wikimedia.org/T283580) (owner: 10LSobanski) [11:24:26] !log phuedx@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/SecurePoll: Backport: [[gerrit:709974|Use real transactions when creating an election]] (duration: 01m 08s) [11:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:00] Done. urbanecm: It looks like the queue is empty [11:25:22] yup, you were the only customer :) [11:25:52] Cool. Thanks, as always, urbanecm [11:25:55] <3 [11:25:56] any time [11:26:51] as heads up: I'll be running docker pull tests in eqiad again. This time from 73 parallel servers max (excluding the ones causing sessionstore alerts this time ;-)) [11:31:45] (03PS1) 10Marostegui: Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/709978 [11:32:25] (03CR) 10Marostegui: [C: 03+2] Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/709978 (owner: 10Marostegui) [11:36:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1170:3317 and db1101:3317 T286888!', diff saved to https://phabricator.wikimedia.org/P16957 and previous config saved to /var/cache/conftool/dbconfig/20210804-113623-marostegui.json [11:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:31] T286888: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 [11:36:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) 05Open→03Resolved db1170:3317 recloned, gtid enabled, notifications enabled, host pooled. [11:37:02] (03PS3) 10Dzahn: site/conftool: convert 4 jobrunners to appservers for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 [11:41:28] (03CR) 10Dzahn: [C: 03+2] "ACK, per https://review.gerrithub.io/plugins/replication/Documentation/config.md default is 3s and can't be lower" [puppet] - 10https://gerrit.wikimedia.org/r/709767 (owner: 10Hashar) [11:42:16] (03PS1) 10Filippo Giunchedi: alerts: filter deployable files by 'deploy-tag' [puppet] - 10https://gerrit.wikimedia.org/r/710007 (https://phabricator.wikimedia.org/T284810) [11:42:18] (03PS1) 10Filippo Giunchedi: alerts: refactor into ::prometheus [puppet] - 10https://gerrit.wikimedia.org/r/710008 (https://phabricator.wikimedia.org/T284810) [11:42:21] (03PS1) 10Filippo Giunchedi: alerts: add Thanos-specific alerts deploy [puppet] - 10https://gerrit.wikimedia.org/r/710009 (https://phabricator.wikimedia.org/T284810) [11:42:24] (03PS1) 10Filippo Giunchedi: thanos: add /srv/alerts-thanos to rule alerts path [puppet] - 10https://gerrit.wikimedia.org/r/710010 (https://phabricator.wikimedia.org/T284810) [11:42:55] (03CR) 10jerkins-bot: [V: 04-1] alerts: filter deployable files by 'deploy-tag' [puppet] - 10https://gerrit.wikimedia.org/r/710007 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [11:43:29] !log installing testvm2001 T286206 [11:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:37] T286206: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 [11:43:46] (03PS4) 10Dzahn: site/conftool: convert 4 jobrunners to appservers for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 [11:46:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:28] (03PS1) 10Reedy: Add export-0.11 to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710011 [11:49:34] jouncebot: now [11:49:34] For the next 0 hour(s) and 10 minute(s): European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1100) [11:49:58] (03PS2) 10Reedy: Add export-0.11 to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710011 (https://phabricator.wikimedia.org/T288040) [11:50:19] (03PS2) 10Filippo Giunchedi: alerts: filter deployable files by 'deploy-tag' [puppet] - 10https://gerrit.wikimedia.org/r/710007 (https://phabricator.wikimedia.org/T284810) [11:50:21] (03PS2) 10Filippo Giunchedi: alerts: refactor into ::prometheus [puppet] - 10https://gerrit.wikimedia.org/r/710008 (https://phabricator.wikimedia.org/T284810) [11:50:23] (03PS2) 10Filippo Giunchedi: alerts: add Thanos-specific alerts deploy [puppet] - 10https://gerrit.wikimedia.org/r/710009 (https://phabricator.wikimedia.org/T284810) [11:50:25] (03PS2) 10Filippo Giunchedi: thanos: add /srv/alerts-thanos to rule alerts path [puppet] - 10https://gerrit.wikimedia.org/r/710010 (https://phabricator.wikimedia.org/T284810) [11:50:27] (03CR) 10Reedy: [C: 03+2] Add export-0.11 to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710011 (https://phabricator.wikimedia.org/T288040) (owner: 10Reedy) [11:51:13] (03CR) 10Jelto: [C: 04-1] site/conftool: convert 4 jobrunners to appservers for balance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709639 (owner: 10Dzahn) [11:52:06] (03Merged) 10jenkins-bot: Add export-0.11 to xml/index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710011 (https://phabricator.wikimedia.org/T288040) (owner: 10Reedy) [11:53:47] !log reedy@deploy1002 Synchronized docroot/mediawiki.org/xml/index.html: T288040 (duration: 01m 08s) [11:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:55] T288040: Missing Link at https://www.mediawiki.org/xml/ for entry export-0.11 - https://phabricator.wikimedia.org/T288040 [11:55:01] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:55:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) @Jclark-ctr We already have quite a few more servers in rack D than in A, B or C, with A having the smallest number. Would it be possible to put all of these... [11:55:33] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:59:47] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:59:58] (03CR) 10Dzahn: site/conftool: convert 4 jobrunners to appservers for balance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709639 (owner: 10Dzahn) [12:00:19] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:01:59] (03PS1) 10Hnowlan: profile::maps::osm_replica: Allow replicas to be connected to by tegola [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) [12:02:25] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:41] (03PS5) 10Dzahn: site/conftool: convert 4 jobrunners to appservers for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 [12:03:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:41] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:05] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30477/console" [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [12:05:17] (03PS6) 10Dzahn: site/conftool: convert 4 jobrunners to appservers for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 [12:05:19] (03CR) 10Jelto: [C: 03+1] "lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/709639 (owner: 10Dzahn) [12:07:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1129', diff saved to https://phabricator.wikimedia.org/P16958 and previous config saved to /var/cache/conftool/dbconfig/20210804-120725-marostegui.json [12:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:14] (03CR) 10Filippo Giunchedi: "Thank you! LGTM overall, see inline" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709671 (owner: 10David Caro) [12:16:07] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [12:16:57] (03PS1) 10Muehlenhoff: Add testvm200[12] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710017 (https://phabricator.wikimedia.org/T286206) [12:17:49] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [12:18:27] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:17] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:16] !log depool prometheus2004 for upgrade - T222113 [12:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:23] T222113: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 [12:23:55] (03PS2) 10MMandere: configmaster: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/709668 (https://phabricator.wikimedia.org/T282787) [12:25:47] (03CR) 10MMandere: configmaster: Add drmrs DC site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709668 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:27:46] (03CR) 10Ladsgroup: [C: 03+1] "It has my virtual blessing" [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [12:33:35] (03PS3) 10Ladsgroup: Add shellbox-constraint services and use them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) [12:33:39] (03CR) 10Ladsgroup: Add shellbox-constraint services and use them (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [12:35:15] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:07] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:56] (03PS2) 10Muehlenhoff: Add testvm200[12] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710017 (https://phabricator.wikimedia.org/T286206) [12:39:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "No calls to services in the cloud env are allowed from production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709991 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [12:39:41] (03CR) 10Muehlenhoff: [C: 03+2] Add testvm200[12] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/710017 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [12:44:44] (03CR) 10Kormat: [C: 03+1] icinga: add MVernon to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/710000 (owner: 10MVernon) [12:44:53] (03CR) 10Elukey: [C: 03+2] apt-repo: update key for the ROCm repositories [puppet] - 10https://gerrit.wikimedia.org/r/709992 (owner: 10Elukey) [12:47:11] (03CR) 10MVernon: [C: 03+2] icinga: add MVernon to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/710000 (owner: 10MVernon) [12:47:53] (03PS1) 10Urbanecm: Initial configuration for jvwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710025 (https://phabricator.wikimedia.org/T286241) [12:48:09] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/709668 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:50:31] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [12:55:00] (03CR) 10Labdajiwa: Initial configuration for jvwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710025 (https://phabricator.wikimedia.org/T286241) (owner: 10Urbanecm) [12:55:41] (03PS2) 10Urbanecm: Initial configuration for jvwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710025 (https://phabricator.wikimedia.org/T286241) [12:55:59] (03CR) 10Urbanecm: Initial configuration for jvwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710025 (https://phabricator.wikimedia.org/T286241) (owner: 10Urbanecm) [12:56:33] (03CR) 10Kormat: [C: 03+1] icinga::ircbot: Send database notifications to #wikimedia-data-persistence (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710002 (https://phabricator.wikimedia.org/T283580) (owner: 10LSobanski) [12:57:54] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [13:00:05] Urbanecm and Amir1: Dear deployers, time to do the Create jvwikisource deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1300). [13:00:08] o/ [13:00:13] o/ [13:00:21] starting then :) [13:00:27] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for jvwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710025 (https://phabricator.wikimedia.org/T286241) (owner: 10Urbanecm) [13:01:13] (03Merged) 10jenkins-bot: Initial configuration for jvwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710025 (https://phabricator.wikimedia.org/T286241) (owner: 10Urbanecm) [13:03:46] addWiki.php works like a charm this time [13:04:27] DB created in correct section, syncing [13:05:45] !log urbanecm@deploy1002 Synchronized wmf-config/db-eqiad.php: Creating jvwikisource (T286241) (duration: 01m 08s) [13:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:54] T286241: Create Wikisource Javanese - https://phabricator.wikimedia.org/T286241 [13:06:04] (03CR) 10JMeybohm: [C: 04-1] helpers: do not repeat ports section for kafka brokers egress rules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/709712 (owner: 10DCausse) [13:07:01] !log urbanecm@deploy1002 Synchronized wmf-config/db-codfw.php: Creating jvwikisource (T286241) (duration: 01m 07s) [13:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:15] (03PS1) 10Lucas Werkmeister (WMDE): Rephrase Bot under the Fountain message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/710028 [13:07:22] (03PS6) 10Giuseppe Lavagetto: mediawiki::website: parametrize the fcgi proxy in all sites [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) [13:07:30] (03CR) 10Lucas Werkmeister (WMDE): ":)" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/710028 (owner: 10Lucas Werkmeister (WMDE)) [13:07:52] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::website: parametrize the fcgi proxy in all sites [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [13:08:10] !log urbanecm@deploy1002 Synchronized dblists: Creating jvwikisource (T286241) (duration: 01m 07s) [13:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:42] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating jvwikisource (T286241) [13:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:49] (03PS7) 10Giuseppe Lavagetto: mediawiki::website: parametrize the fcgi proxy in all sites [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) [13:10:56] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating jvwikisource (T286241) (duration: 01m 07s) [13:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:03] T286241: Create Wikisource Javanese - https://phabricator.wikimedia.org/T286241 [13:12:01] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::website: parametrize the fcgi proxy in all sites [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [13:14:13] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating jvwikisource (T286241) (duration: 01m 06s) [13:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:23] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating jvwikisource (T286241) (duration: 01m 06s) [13:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:33] (03PS8) 10Giuseppe Lavagetto: mediawiki::website: parametrize the fcgi proxy in all sites [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) [13:15:44] looks like this is the only wiki we have for today [13:15:51] doing cache [13:16:00] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710033 [13:16:02] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710033 (owner: 10Urbanecm) [13:16:46] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30480/console" [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [13:17:53] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710033 (owner: 10Urbanecm) [13:18:11] !log upgraded python3-wmflib to v0.0.9 fleet wide [13:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:00] (03PS4) 10Volans: decorators: migrate to the wmflib version [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) [13:19:04] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 03m 11s) [13:19:09] (03CR) 10Volans: decorators: migrate to the wmflib version (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/704345 (https://phabricator.wikimedia.org/T257905) (owner: 10Volans) [13:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:19] this should be all, right Amir1 ? [13:19:32] yup [13:19:39] !log jvwikisource was created (T286241) [13:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:46] T286241: Create Wikisource Javanese - https://phabricator.wikimedia.org/T286241 [13:19:48] great. Thanks for the mental support :) [13:21:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::website: parametrize the fcgi proxy in all sites [puppet] - 10https://gerrit.wikimedia.org/r/708789 (https://phabricator.wikimedia.org/T285298) (owner: 10Giuseppe Lavagetto) [13:24:08] (03PS21) 10Elukey: Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [13:24:10] (03PS3) 10Elukey: Add kfserving basic helmfile config under admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) [13:24:36] (03PS2) 10DCausse: helpers: generate proper yaml for kafka egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/709712 [13:24:45] (03CR) 10jerkins-bot: [V: 04-1] Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:24:54] (03CR) 10jerkins-bot: [V: 04-1] Add kfserving basic helmfile config under admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:25:09] (03CR) 10DCausse: helpers: generate proper yaml for kafka egress rules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/709712 (owner: 10DCausse) [13:27:51] (03CR) 10Hnowlan: profile::maps::osm_replica: Allow replicas to be connected to by tegola [puppet] - 10https://gerrit.wikimedia.org/r/710013 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [13:30:14] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:46] (03PS22) 10Elukey: Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [13:30:48] (03PS4) 10Elukey: Add kfserving basic helmfile config under admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) [13:32:05] (03PS4) 10Volans: Class API: add rollback() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 [13:32:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:33] (03CR) 10Elukey: "To keep archives happy - I had a chat with Janis about the charts, and after reviewing https://github.com/kubeflow/kfserving/blob/master/h" [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:35:39] (03CR) 10Hnowlan: [V: 03+1] "Including postgres-adjacent people for visibility" [puppet] - 10https://gerrit.wikimedia.org/r/709717 (https://phabricator.wikimedia.org/T283159) (owner: 10Hnowlan) [13:36:18] (03PS1) 10Urbanecm: jvwikisource: Add author namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710037 (https://phabricator.wikimedia.org/T286241) [13:37:02] (03CR) 10JMeybohm: [C: 04-1] Add kubeflow's kfserving charts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:40:09] (03CR) 10JMeybohm: [C: 03+1] "This LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:41:31] (03CR) 10Addshore: Add config for the updated PropertySuggester for test Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709991 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [13:43:09] (03PS3) 10DCausse: helpers: generate proper yaml for kafka egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/709712 [13:43:11] (03PS1) 10DCausse: rdf-streaming-updater: Drop custom kafka egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/710042 [13:45:35] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/709712 (owner: 10DCausse) [13:45:48] (03CR) 10JMeybohm: [C: 03+1] rdf-streaming-updater: Drop custom kafka egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/710042 (owner: 10DCausse) [13:46:45] (03CR) 10Labdajiwa: [C: 03+1] "The translations are accurate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710037 (https://phabricator.wikimedia.org/T286241) (owner: 10Urbanecm) [13:46:57] jouncebot: now [13:46:57] For the next 0 hour(s) and 13 minute(s): Create jvwikisource (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1300) [13:47:04] still my window, going to sync it [13:47:07] (03CR) 10Urbanecm: [C: 03+2] jvwikisource: Add author namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710037 (https://phabricator.wikimedia.org/T286241) (owner: 10Urbanecm) [13:47:46] 10SRE, 10Infrastructure-Foundations, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10cmooney) a:03cmooney [13:47:57] (03CR) 10Urbanecm: [C: 03+2] jvwikisource: Add author namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710037 (https://phabricator.wikimedia.org/T286241) (owner: 10Urbanecm) [13:48:09] (03Merged) 10jenkins-bot: jvwikisource: Add author namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710037 (https://phabricator.wikimedia.org/T286241) (owner: 10Urbanecm) [13:49:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:39] (03CR) 10DCausse: [C: 03+2] helpers: generate proper yaml for kafka egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/709712 (owner: 10DCausse) [13:50:50] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5d7255c1127f951da59b9b48749fe9cf59e11930: jvwikisource: Add author namespace (T286241) (duration: 01m 06s) [13:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:57] T286241: Create Wikisource Javanese - https://phabricator.wikimedia.org/T286241 [13:53:01] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: Drop custom kafka egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/710042 (owner: 10DCausse) [13:53:25] (03Merged) 10jenkins-bot: helpers: generate proper yaml for kafka egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/709712 (owner: 10DCausse) [13:55:39] (03Merged) 10jenkins-bot: rdf-streaming-updater: Drop custom kafka egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/710042 (owner: 10DCausse) [13:55:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:16] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [14:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: Disk failure for elastic1039.eqiad.wmnet - https://phabricator.wikimedia.org/T286497 (10Cmjohnson) 05Open→03Resolved That's great @RKemper I will resolve this task now. [14:08:52] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Cmjohnson) @Kormat I can update f/w today, please take the server offline and I run the f/w updates. [14:11:20] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:23] kormat: Let me know if you want me to do that now? ^ [14:12:00] marostegui: ah, thanks. i got it [14:12:05] kormat: <3 [14:12:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:14] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:28] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 18 hosts with reason: Firmware upgrade on db1104 (s8 primary) T286226 [14:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:38] T286226: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 [14:13:42] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 18 hosts with reason: Firmware upgrade on db1104 (s8 primary) T286226 [14:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:58] !log imported gitlab-ce 13.12.9 to thirdparty/gitlab T287671 [14:17:06] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Kormat) @Cmjohnson: db1104 is now powered off, update at will. Thanks! [14:17:22] !log depool prometheus2004 and pool prometheus2003 - T222113 [14:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:30] T222113: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 [14:18:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:59] (03CR) 10Addshore: "Let's go discuss on https://phabricator.wikimedia.org/T285098" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709991 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [14:19:28] moritzm: your gitlab import !log had a space in front of it and wasn't counted [14:19:41] !log imported gitlab-ce 13.12.9 to thirdparty/gitlab T287671 [14:19:43] majavah: thx [14:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:26] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:22:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:25:27] (03CR) 10Kormat: [C: 03+1] production-m5.sql.erb: Add toolhub grants [puppet] - 10https://gerrit.wikimedia.org/r/709877 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [14:28:46] !log upgrade prometheus on prometheus4001 - T222113 [14:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:54] T222113: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 [14:30:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/709668 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [14:30:40] !log upgrade prometheus on cloudmetrics hosts - T222113 [14:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:53] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) @Papaul do you know what the status is with this device? I can confirm there are some characters visible via serial console / port 47... [14:36:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709693 (owner: 10MVernon) [14:41:05] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Andrew) Let me know when the drive shows up and I'll take that host out of service so you can power it down. [14:50:25] (03CR) 10Elukey: Add kubeflow's kfserving charts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [14:50:33] (03PS23) 10Elukey: Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [14:50:35] (03PS5) 10Elukey: Add kfserving basic helmfile config under admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) [14:52:42] (03PS4) 10Hashar: multiversion: enhance buildDBList output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 [14:53:44] (03CR) 10Hashar: "I have misunderstood the source of the dblist files, thx Timo for the explanation. In PS4 I have dropped the part that change the files he" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [14:53:55] (03PS13) 10Juan90264: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [14:54:30] (03CR) 10Klausman: [C: 03+1] Add kfserving basic helmfile config under admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [14:55:36] (03CR) 10jerkins-bot: [V: 04-1] Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [14:57:33] (03CR) 10Bstorm: [C: 03+2] metricsinfra: Add IRC bot for alerting [puppet] - 10https://gerrit.wikimedia.org/r/709514 (https://phabricator.wikimedia.org/T287148) (owner: 10Majavah) [14:58:03] (03CR) 10JMeybohm: [C: 03+1] Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [14:59:54] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:02:10] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:08:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709950 (owner: 10MVernon) [15:08:35] (03CR) 10Herron: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/30483/" [puppet] - 10https://gerrit.wikimedia.org/r/710008 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [15:10:30] (03CR) 10Herron: [C: 03+1] alerts: add Thanos-specific alerts deploy [puppet] - 10https://gerrit.wikimedia.org/r/710009 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [15:10:39] (03CR) 10Herron: [C: 03+1] thanos: add /srv/alerts-thanos to rule alerts path [puppet] - 10https://gerrit.wikimedia.org/r/710010 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [15:22:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10dcaro) @cmooney hey, I acknowledge that tomorrow is a good time, ping me whenever you want to get it going :) [15:25:14] (03CR) 10Ahmon Dancy: [C: 03+1] multiversion: enhance buildDBList output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [15:29:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10dcaro) Ack for tomorrow too (same as T288036) [15:36:08] (03PS1) 10Cwhite: pontoon: add pontoon logging environment [puppet] - 10https://gerrit.wikimedia.org/r/710056 [15:40:05] 10SRE, 10Release-Engineering-Team: releases1002 /srv/docker DISK SPACE alert - https://phabricator.wikimedia.org/T288024 (10thcipriani) >>! In T288024#7258016, @hashar wrote: > That is routinely filing up due to some Jenkins job creating images/containers but not reclaiming them at end of build. @dancy / @dduv... [15:41:32] (03PS1) 10Filippo Giunchedi: grafana: enforce minimum 30s dashboard refresh [puppet] - 10https://gerrit.wikimedia.org/r/710058 (https://phabricator.wikimedia.org/T119719) [15:41:38] 10SRE, 10Release-Engineering-Team: releases1002 /srv/docker DISK SPACE alert - https://phabricator.wikimedia.org/T288024 (10hashar) 05Open→03Resolved a:03hashar Great, and this task can be marked as resolved since immediate action have been taken earlier today to resolve the alarm. [15:47:39] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw235[1357].wmnet [15:47:45] (03CR) 10BryanDavis: production-m5.sql.erb: Add toolhub grants (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709877 (https://phabricator.wikimedia.org/T271480) (owner: 10Marostegui) [15:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:27] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw235[1357].codfw.wmnet [15:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:45] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw237[789].codfw.wmnet [15:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:28] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2380.codfw.wmnet [15:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:24] (03PS7) 10Dzahn: site/conftool: convert 4 jobrunners to appservers for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 [15:54:36] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10Papaul) I as far as i know I removed the old faulty device, replaced it with this on, connected the console, power and network to the device, t... [15:54:49] (03CR) 10Dzahn: [C: 03+2] site/conftool: convert 4 jobrunners to appservers for balance [puppet] - 10https://gerrit.wikimedia.org/r/709639 (owner: 10Dzahn) [15:55:15] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10MoritzMuehlenhoff) The Ganeti test cluster has been set up, along with two test instances (testvm2001/2002). Next it will be used to test the Buster update. [15:55:36] (03PS1) 10Urbanecm: Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709983 (https://phabricator.wikimedia.org/T288023) [15:56:04] (03PS1) 10Urbanecm: Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709984 (https://phabricator.wikimedia.org/T288023) [15:56:39] jouncebot: next [15:56:40] In 2 hour(s) and 3 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1800) [15:56:40] In 2 hour(s) and 3 minute(s): Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1800) [15:58:05] !log mw2351, mw2353, mw2355, mw2357 - converting from appserver to jobrunner, mw2377, mw2378, mw2379, mw2380 - converting from jobrunner to appserver - for balancing of server types over rows [15:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:54] (03CR) 10Bstorm: [C: 03+2] "They were removed during the Trusty to Stretch upgrade and that was a while ago. I'm willing to just say we don't support that anymore if " [puppet] - 10https://gerrit.wikimedia.org/r/703619 (owner: 10Majavah) [15:59:42] PROBLEM - PHP7 jobrunner on mw2378 is CRITICAL: connect to address 10.192.0.41 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Jobrunner [16:00:16] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) @ema That makes sense, thanks for the... [16:00:24] PROBLEM - mediawiki-installation DSH group on mw2357 is CRITICAL: Host mw2357 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:00:41] (03CR) 10Herron: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/710058 (https://phabricator.wikimedia.org/T119719) (owner: 10Filippo Giunchedi) [16:00:50] PROBLEM - PHP7 rendering on mw2378 is CRITICAL: connect to address 10.192.0.41 and port 9005: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:01:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw[2377-2379].codfw.wmnet with reason: reimage [16:01:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw[2377-2379].codfw.wmnet with reason: reimage [16:01:17] (03CR) 10Cwhite: [C: 03+1] "Looks ok to me modulo other comments." [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709671 (owner: 10David Caro) [16:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2380.codfw.wmnet with reason: reimage [16:01:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2380.codfw.wmnet with reason: reimage [16:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:02] PROBLEM - PHP7 rendering on mw2357 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 479 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:02:20] PROBLEM - Host db1104.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:02:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2357.codfw.wmnet with reason: reimage [16:02:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2357.codfw.wmnet with reason: reimage [16:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:52] RECOVERY - Host db1104.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [16:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:09] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/710007 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [16:04:31] 10SRE, 10Cassandra, 10RESTBase-Cassandra, 10Patch-For-Review, and 2 others: Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659 (10hnowlan) [16:05:42] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/710008 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [16:06:40] (03CR) 10Cwhite: [C: 03+1] alerts: add Thanos-specific alerts deploy [puppet] - 10https://gerrit.wikimedia.org/r/710009 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [16:07:01] (03CR) 10Cwhite: [C: 03+1] thanos: add /srv/alerts-thanos to rule alerts path [puppet] - 10https://gerrit.wikimedia.org/r/710010 (https://phabricator.wikimedia.org/T284810) (owner: 10Filippo Giunchedi) [16:09:13] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/710058 (https://phabricator.wikimedia.org/T119719) (owner: 10Filippo Giunchedi) [16:12:09] (03CR) 10Cwhite: [C: 03+1] logstash: add logstash103[345] to eqiad elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709731 (https://phabricator.wikimedia.org/T287938) (owner: 10Herron) [16:12:44] (03CR) 10Cwhite: [C: 03+1] logstash: add logstash203[345] to codfw elk cluster [puppet] - 10https://gerrit.wikimedia.org/r/709732 (https://phabricator.wikimedia.org/T287938) (owner: 10Herron) [16:12:48] (03CR) 10MVernon: [V: 03+2 C: 03+2] wmf-update-ssh-config: add option to skip systemd activation [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709950 (owner: 10MVernon) [16:13:10] (03CR) 10MVernon: [V: 03+2 C: 03+2] Correct documented path of wmf-update-ssh-config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/709693 (owner: 10MVernon) [16:13:30] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1008.eqiad.wmnet [16:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:40] !log draining maps1008 from cassandra cluster [16:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1008.eqiad.wmnet with reason: Rebuilding as buster replica of maps1009 [16:15:13] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1008.eqiad.wmnet with reason: Rebuilding as buster replica of maps1009 [16:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2351.codfw.wmnet with reason: REIMAGE [16:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2353.codfw.wmnet with reason: REIMAGE [16:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2351.codfw.wmnet with reason: REIMAGE [16:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:16] RECOVERY - PHP7 rendering on mw2357 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:20:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2355.codfw.wmnet with reason: REIMAGE [16:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:20] !log find . -type f -delete on /var/cache/nginx-docker-registry on registry2*, the disk is too small for unbound cache *and* accepting large uploads [16:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:28] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2353.codfw.wmnet with reason: REIMAGE [16:21:29] dancy: ^^ [16:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2353.codfw.wmnet with reason: reimage [16:21:45] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on mw2353.codfw.wmnet with reason: reimage [16:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2353.codfw.wmnet with reason: reimage [16:22:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2353.codfw.wmnet with reason: reimage [16:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:56] (03PS2) 10Cwhite: pontoon: add pontoon logging environment [puppet] - 10https://gerrit.wikimedia.org/r/710056 [16:22:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2357.codfw.wmnet with reason: reimage [16:22:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2357.codfw.wmnet with reason: reimage [16:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:34] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2355.codfw.wmnet with reason: REIMAGE [16:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:01] (03CR) 10jerkins-bot: [V: 04-1] Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709984 (https://phabricator.wikimedia.org/T288023) (owner: 10Urbanecm) [16:25:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2355.codfw.wmnet with reason: reimage [16:25:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2355.codfw.wmnet with reason: reimage [16:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:28] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Majavah) [16:28:49] ACKNOWLEDGEMENT - PHP7 jobrunner on mw2353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimage https://wikitech.wikimedia.org/wiki/Jobrunner [16:29:10] (03PS1) 10Btullis: Create discrete log files per user notebook [puppet] - 10https://gerrit.wikimedia.org/r/710065 (https://phabricator.wikimedia.org/T287339) [16:31:11] (03CR) 10Jforrester: [C: 03+1] Rephrase Bot under the Fountain message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/710028 (owner: 10Lucas Werkmeister (WMDE)) [16:35:30] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:37:10] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:26] RECOVERY - PHP7 rendering on mw2378 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:03] (03CR) 10Luke081515: [C: 03+1] Add *.happysrv.de to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709990 (https://phabricator.wikimedia.org/T288039) (owner: 10Zabe) [16:40:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:21] (03CR) 10Elukey: [C: 03+2] Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [16:42:25] (03CR) 10Elukey: [C: 03+2] Add kfserving basic helmfile config under admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [16:44:36] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) [16:46:40] (03PS1) 10Urbanecm: updateMenteeData: Output how long the script took [extensions/GrowthExperiments] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709985 (https://phabricator.wikimedia.org/T287964) [16:47:23] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709984 (https://phabricator.wikimedia.org/T288023) (owner: 10Urbanecm) [16:49:37] jouncebot: now [16:49:37] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [16:49:41] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Output how long the script took [extensions/GrowthExperiments] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709985 (https://phabricator.wikimedia.org/T287964) (owner: 10Urbanecm) [16:52:45] 10SRE, 10ops-eqiad, 10DBA: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 (10Cmjohnson) 05Open→03Resolved @kormat The f/w update has been completed. I am able to ssh into the host. resolving the task, if you have continue to have issues please let me know. [16:55:28] !log mw2351, mw2353, mw2355 - scap pull [16:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:39] (03PS1) 10Majavah: metricsinfra: Add config management server [puppet] - 10https://gerrit.wikimedia.org/r/710068 (https://phabricator.wikimedia.org/T286299) [17:06:22] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) Plan looks good to me! I'll suggest also spinning off a subtask or spreadsheet to... [17:06:29] (03PS2) 10Krinkle: Move parsercache DB config to *Services.php (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 [17:06:33] (03PS2) 10Krinkle: Move parsercache DB config to *Services.php (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703630 [17:06:36] (03PS2) 10Krinkle: Move parsercache DB config to *Services.php (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703631 [17:07:41] (03CR) 10Majavah: [C: 04-1] Move parsercache DB config to *Services.php (2/3) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703630 (owner: 10Krinkle) [17:08:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2357.codfw.wmnet with reason: REIMAGE [17:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2377.codfw.wmnet with reason: REIMAGE [17:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:15] (03PS3) 10Krinkle: Move parsercache DB config to *Services.php (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703630 [17:09:21] (03PS3) 10Krinkle: Move parsercache DB config to *Services.php (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703631 [17:09:57] majavah: thanks :) [17:10:00] (03Merged) 10jenkins-bot: updateMenteeData: Output how long the script took [extensions/GrowthExperiments] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709985 (https://phabricator.wikimedia.org/T287964) (owner: 10Urbanecm) [17:10:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:10:39] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2357.codfw.wmnet with reason: REIMAGE [17:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2378.codfw.wmnet with reason: REIMAGE [17:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:38] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:16] (03CR) 10Krinkle: "This makes the output quite long and noisy. I don't understand why progress is useful here since the whole thing takes less than a second " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:12:17] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/GrowthExperiments/maintenance/updateMenteeData.php: 66c2c7593322dfc575edc818aaff8d9b79466bdd: updateMenteeData: Output how long the script took (T287964) (duration: 01m 07s) [17:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:25] T287964: updateMenteeData should say how long it took to generate the data - https://phabricator.wikimedia.org/T287964 [17:12:42] * urbanecm done [17:13:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2377.codfw.wmnet with reason: REIMAGE [17:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:53] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2378.codfw.wmnet with reason: REIMAGE [17:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:03] (03PS5) 10Krinkle: multiversion: enhance buildDBList output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:20:30] (03PS3) 10BryanDavis: toolhub: initial chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) [17:21:51] (03PS6) 10Krinkle: multiversion: enhance buildDBList output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:22:58] (03CR) 10jerkins-bot: [V: 04-1] multiversion: enhance buildDBList output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:24:16] (03PS7) 10Krinkle: multiversion: enhance buildDBList output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:24:25] (03CR) 10Krinkle: [C: 03+1] "Proposed a slightly more minimal version." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:25:00] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw2351.codfw.wmnet [17:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:11] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw2353.codfw.wmnet [17:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:18] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw2355.codfw.wmnet [17:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2351.codfw.wmnet [17:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2353.codfw.wmnet [17:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:20] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2355.codfw.wmnet [17:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:47] (03CR) 10Hashar: multiversion: enhance buildDBList output (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:29:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:28] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw238[1-2].codfw.wmnet [17:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:35] (03CR) 10Hashar: [C: 04-1] "So Timo pointed out that this script runs on CI in ~ 5 seconds and should be fast overhaul. The mystery is why it takes 4 minutes on my l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [17:39:05] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:06] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw2357.codfw.wmnet [17:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:33] !log mw2357, mw2377, mw2378 - scap pull [17:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:51] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:15] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw2357.codfw.wmnet [17:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:05] (03PS1) 10Bartosz Dziewoński: DiscussionTools: Make 'sourcemodetoolbar' available everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710075 (https://phabricator.wikimedia.org/T287927) [17:45:12] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2357.codfw.wmnet [17:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:32] (03CR) 10Ottomata: [C: 03+1] "I was going to suggest putting logs in user homedirs, but that makes configuring the logrotate rule much harder." [puppet] - 10https://gerrit.wikimedia.org/r/710065 (https://phabricator.wikimedia.org/T287339) (owner: 10Btullis) [17:46:23] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw237[7-9].codfw.wmnet [17:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:37] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw2380.codfw.wmnet [17:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:13] (03CR) 10Ottomata: [C: 03+1] Create discrete log files per user notebook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710065 (https://phabricator.wikimedia.org/T287339) (owner: 10Btullis) [17:49:15] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2377.codfw.wmnet [17:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:51] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2378.codfw.wmnet [17:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:09] (03PS1) 10Bartosz Dziewoński: DiscussionTools: Make 'newtopictool' available to everyone on arwiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710076 (https://phabricator.wikimedia.org/T285724) [17:53:43] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:49] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dancy) [17:53:59] 10SRE, 10MW-on-K8s, 10serviceops: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10dancy) 05Open→03Resolved This is fixed now. You can firm by testing with this image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-08-04-173113-... [17:55:17] (03CR) 10BryanDavis: toolhub: initial chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [17:55:31] (03PS3) 10Ottomata: Enable canary events by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709531 (https://phabricator.wikimedia.org/T287789) [17:55:42] (03PS4) 10Ottomata: Enable canary events by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709531 (https://phabricator.wikimedia.org/T287789) [17:56:53] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2379.codfw.wmnet with reason: REIMAGE [17:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:29] (03CR) 10Bartosz Dziewoński: [C: 04-1] "Not to be deployed yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710076 (https://phabricator.wikimedia.org/T285724) (owner: 10Bartosz Dziewoński) [17:59:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2380.codfw.wmnet with reason: REIMAGE [17:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:49] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2379.codfw.wmnet with reason: REIMAGE [17:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] dduvall and twentyafterfour: Dear deployers, time to do the Train log triage with CPT deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1800). [18:00:05] Urbanecm: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:24] I'll self-service [18:00:34] (03CR) 10Urbanecm: [C: 03+2] Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709984 (https://phabricator.wikimedia.org/T288023) (owner: 10Urbanecm) [18:00:38] (03CR) 10Urbanecm: [C: 03+2] Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709983 (https://phabricator.wikimedia.org/T288023) (owner: 10Urbanecm) [18:00:54] RECOVERY - mediawiki-installation DSH group on mw2357 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:01:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mw2379.codfw.wmnet with reason: reimage [18:01:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw2379.codfw.wmnet with reason: reimage [18:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mw2380.codfw.wmnet with reason: reimage [18:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:33] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on mw2380.codfw.wmnet with reason: reimage [18:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:58] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2380.codfw.wmnet with reason: REIMAGE [18:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:59] (03Abandoned) 10Ahmon Dancy: Generate mediawiki-multiversion-debug image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708559 (https://phabricator.wikimedia.org/T287495) (owner: 10Ahmon Dancy) [18:03:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mw2380.codfw.wmnet with reason: reimage [18:03:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw2380.codfw.wmnet with reason: reimage [18:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:47] urbanecm: if you are clear, i'd like to deploy a config change [18:05:54] ottomata: go ahead and ping me when done please [18:06:00] * urbanecm is waiting for CI [18:06:08] (03CR) 10Ottomata: [C: 03+2] Enable canary events by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709531 (https://phabricator.wikimedia.org/T287789) (owner: 10Ottomata) [18:10:25] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable canary events by default - T287789 (duration: 01m 06s) [18:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:34] T287789: Enable canary events for streams by default - https://phabricator.wikimedia.org/T287789 [18:11:11] !log gitlab2001: upgrading to 13.12.9 [18:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:49] am done urbanecm thank you [18:11:54] great, thanks [18:16:58] !log gitlab1001: upgrading to 13.12.9 [18:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:23] (03Merged) 10jenkins-bot: Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709984 (https://phabricator.wikimedia.org/T288023) (owner: 10Urbanecm) [18:20:25] (03CR) 10jerkins-bot: [V: 04-1] Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709983 (https://phabricator.wikimedia.org/T288023) (owner: 10Urbanecm) [18:20:30] meh [18:20:42] (03CR) 10Urbanecm: Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709983 (https://phabricator.wikimedia.org/T288023) (owner: 10Urbanecm) [18:20:46] (03CR) 10Urbanecm: [C: 03+2] Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709983 (https://phabricator.wikimedia.org/T288023) (owner: 10Urbanecm) [18:20:47] PROBLEM - DPKG on gitlab1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:25:00] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:24] !log mw2379, mw2380 - scap pull [18:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:16] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:57] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/GrowthExperiments/: 36a2b9f58148dad5434daa6d03b77f4c8b839314: Fix array key handling for GEHelpPanelLinks in on-wiki config (T288023) (duration: 01m 06s) [18:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:05] T288023: GrowthExperiments: GEHelpPanelLinks causing "PHP Notice: Undefined offset: 0" - https://phabricator.wikimedia.org/T288023 [18:33:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2379.codfw.wmnet [18:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:09] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2380.codfw.wmnet [18:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:07] (03Merged) 10jenkins-bot: Fix array key handling for GEHelpPanelLinks in on-wiki config [extensions/GrowthExperiments] (wmf/1.37.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709983 (https://phabricator.wikimedia.org/T288023) (owner: 10Urbanecm) [18:38:16] finally [18:38:43] pulls again [18:39:35] mutante: I'm going to scap sync-file it if that helps [18:39:44] is done converting jobrunners <-> app servers [18:39:54] urbanecm: should be good now, thank you! [18:40:01] just 2379 and 2380 [18:40:11] the others were pooled before you started [18:40:28] great [18:41:23] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.17/extensions/GrowthExperiments/: 5c3ac582335265287369e2d06332645ddbcba412: Fix array key handling for GEHelpPanelLinks in on-wiki config (T288023) (duration: 01m 08s) [18:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:31] T288023: GrowthExperiments: GEHelpPanelLinks causing "PHP Notice: Undefined offset: 0" - https://phabricator.wikimedia.org/T288023 [18:41:31] * urbanecm done [18:42:34] :) urbanecm: it was about balancing server types a little bit better across rows.. so if a row happens to go offline we should survive [18:42:41] laters [18:42:59] i see. Thanks :) [18:43:28] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:00] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:14] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, and 2 others: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10dancy) 05Open→03Resolved This is done. Now whenever a docker-registry.discovery.wmnet/restricted/medi... [18:49:20] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dancy) [18:56:24] Just a note that I ran the generateTestElection.php SecurePoll maintenance script to create a test election on the Beta Cluster a moment ago. It was only afterwards that I remembered that I'd disconnected from IRC before my lunch [19:00:05] dduvall and twentyafterfour: May I have your attention please! MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1900) [19:05:46] twentyafterfour: o/ [19:06:30] PROBLEM - Host cloudvirt1038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:06:44] dduvall: I'm around if I'm needed. [19:09:06] (03PS1) 10Dduvall: group1 wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710080 [19:09:08] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710080 (owner: 10Dduvall) [19:09:47] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710080 (owner: 10Dduvall) [19:11:17] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.17 [19:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:24] RECOVERY - DPKG on gitlab1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:12:33] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.17 (duration: 01m 15s) [19:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:28] !log 1.37.0-wmf.17 promoted to group1. no new errors or troubling error rates spotted (T281158) [19:22:31] * dduvall done [19:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:36] T281158: 1.37.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T281158 [19:24:12] 10SRE, 10observability: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10herron) +1 to removing the check. We also have since enabled shell TMOUT which helps clean up cases where shells are left idle. Currently that's a 5 day timeout. [19:24:36] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:06] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Jclark-ctr) zoom meeting with Dell Tech support provided updated tsr report. system log shows error since last date of service but no current error. running hardware test... [19:35:54] RECOVERY - Host cloudvirt1038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [19:38:08] (03PS1) 10Brennen Bearnes: gitlab / idp: open gitlab access to all users [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) [19:40:55] (03Abandoned) 10Hashar: multiversion: enhance buildDBList output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689673 (owner: 10Hashar) [19:40:58] (03CR) 10Majavah: "Is it intentional that this one does not modify required groups of the gitlab replica service below?" [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [19:42:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Jclark-ctr) @dzahn we have no more spaces in row A for any host [19:43:12] (03CR) 10Brennen Bearnes: gitlab / idp: open gitlab access to all users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [19:50:26] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and profile work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) [19:50:35] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) [19:51:04] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) [19:51:14] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, and 2 others: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10Krinkle) [19:51:18] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) a:03dpifke [20:00:05] dduvall and twentyafterfour: (Dis)respected human, time to deploy MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T1900). Please do the needful. [20:00:05] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T2000). [20:08:00] (03PS1) 10Legoktm: mwdebug: Add shellbox-constraints envoyproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/710109 (https://phabricator.wikimedia.org/T285104) [20:08:09] (03PS3) 10Legoktm: services_proxy: Add envoyproxy for shellbox-constraints [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) [20:10:16] (03CR) 10Legoktm: [C: 03+2] services_proxy: Add envoyproxy for shellbox-constraints [puppet] - 10https://gerrit.wikimedia.org/r/709960 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [20:12:45] (03CR) 10Legoktm: [C: 04-1] Add shellbox-constraint services and use them (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709821 (https://phabricator.wikimedia.org/T176312) (owner: 10Ladsgroup) [20:16:20] (03CR) 10Legoktm: [C: 03+2] mwdebug: Add shellbox-constraints envoyproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/710109 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [20:19:02] (03Merged) 10jenkins-bot: mwdebug: Add shellbox-constraints envoyproxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/710109 (https://phabricator.wikimedia.org/T285104) (owner: 10Legoktm) [20:21:33] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:54] (03PS1) 10Ottomata: eventgate - Disable http service if tls.enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/710111 (https://phabricator.wikimedia.org/T255871) [20:22:13] 10SRE, 10Analytics, 10Prod-Kubernetes, 10serviceops, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) [20:22:31] 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 3 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) [20:23:42] 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 3 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) I think that will do it. helm template looks good locally. @JMeybohm is it ok that I moved the debug ports to their own Service? That'... [20:29:59] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) [20:38:06] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:47:38] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Jclark-ctr) @ayounsi @cmooney used different crimper and tested with cable tester shows good now on management cable. [20:48:28] 10SRE, 10SRE Observability (FY2021/2022-Q1): Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10lmata) [20:56:47] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:36] 10SRE, 10Datacenter-Switchover: switchdc check on mwmaint for running PHP processes should ignore php-fpm processes - https://phabricator.wikimedia.org/T285804 (10Legoktm) a:03Legoktm This is actually pretty easy [21:07:58] (03PS1) 10Legoktm: mediawiki: Don't emit "Stray php processes..." warning for php-fpm [software/spicerack] - 10https://gerrit.wikimedia.org/r/710114 (https://phabricator.wikimedia.org/T285804) [21:18:08] (03CR) 10RLazarus: mediawiki: Don't emit "Stray php processes..." warning for php-fpm (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/710114 (https://phabricator.wikimedia.org/T285804) (owner: 10Legoktm) [21:24:12] 10SRE, 10Datacenter-Switchover: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) [21:24:16] 10SRE, 10DBA, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Legoktm) [21:24:28] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) [21:42:56] 10Puppet, 10Infrastructure-Foundations, 10Wikidata: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Legoktm) [21:44:39] (03PS2) 10Legoktm: mediawiki: Remove old 'parser_cache_purging' job [puppet] - 10https://gerrit.wikimedia.org/r/702427 (owner: 10Krinkle) [22:11:36] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@34cd541]: gerrit:709835 and 709836 [22:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:28] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@34cd541]: gerrit:709835 and 709836 (duration: 06m 52s) [22:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:02] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Legoktm) >>! In T287983#7257682, @RLazarus wrote... [22:47:49] (03CR) 10Krinkle: [C: 03+1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [22:52:22] 10SRE, 10DynamicPageList (Wikimedia), 10PoolCounter, 10serviceops, and 9 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Legoktm) bawolff and I discussed this a bit more yesterday and think that something is probably going wrong with `nowait:` / nested locks. In theo... [22:55:12] (03PS1) 10RLazarus: icinga: Tweak --services API [puppet] - 10https://gerrit.wikimedia.org/r/710121 (https://phabricator.wikimedia.org/T285803) [22:59:38] (03CR) 10RLazarus: "Replying to your post-merge comments on https://gerrit.wikimedia.org/r/c/operations/puppet/+/708384." [puppet] - 10https://gerrit.wikimedia.org/r/710121 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210804T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:30] (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove old 'parser_cache_purging' job [puppet] - 10https://gerrit.wikimedia.org/r/702427 (owner: 10Krinkle) [23:15:38] 10SRE, 10Services, 10Toolhub, 10serviceops, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [23:17:46] (03CR) 10Thcipriani: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/710083 (https://phabricator.wikimedia.org/T288162) (owner: 10Brennen Bearnes) [23:19:03] 10SRE, 10Services, 10Toolhub, 10serviceops, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) I have edited the description to remove celery and redis from the initial deployment requirements. There would only be one celery job to run with th... [23:24:48] (03CR) 10Legoktm: mediawiki: Don't emit "Stray php processes..." warning for php-fpm (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/710114 (https://phabricator.wikimedia.org/T285804) (owner: 10Legoktm) [23:25:04] (03PS2) 10Legoktm: mediawiki: Ignore php-fpm when stopping cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/710114 (https://phabricator.wikimedia.org/T285804) [23:51:02] (03PS1) 10Ppchelko: Clean up temporary variable wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710126 (https://phabricator.wikimedia.org/T274436) [23:57:08] (03CR) 10RLazarus: [C: 03+1] "Thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/710114 (https://phabricator.wikimedia.org/T285804) (owner: 10Legoktm) [23:58:14] (03PS1) 10Legoktm: noc: Expose primary datacenter on conf/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 [23:59:22] (03CR) 10jerkins-bot: [V: 04-1] noc: Expose primary datacenter on conf/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710128 (owner: 10Legoktm)