[00:00:05] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:11] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:03] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:31] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:33:41] 10SRE, 10Wikimedia-Mailing-lists: East and South East Asia and Pacific (ESEAP) mailing list set up - https://phabricator.wikimedia.org/T316454 (10AmandaSLawrence) [01:33:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:36:29] 10SRE, 10Wikimedia-Mailing-lists: East and South East Asia and Pacific (ESEAP) mailing list set up - https://phabricator.wikimedia.org/T316454 (10AmandaSLawrence) Thanks, I've add in an alternate email which will go to Alex Lum or whoever is the WMAU secretary. The list is for ESEAP discussions https://meta.wi... [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:39] RECOVERY - Check systemd state on dse-k8s-worker1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:45] PROBLEM - Check systemd state on dse-k8s-worker1006 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2103 as master in dbctl T316481', diff saved to https://phabricator.wikimedia.org/P33562 and previous config saved to /var/cache/conftool/dbconfig/20220829-051020-marostegui.json [05:10:27] T316481: Mismatch between topology and replication in codfw s1 - https://phabricator.wikimedia.org/T316481 [05:12:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust weights on s1 T316481', diff saved to https://phabricator.wikimedia.org/P33563 and previous config saved to /var/cache/conftool/dbconfig/20220829-051206-marostegui.json [05:17:15] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:21:59] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:22:39] the latency graph for codfw is funky since 4:56 [05:24:17] it is usually around 275ms +/- 25 ms but the means now bumps occasionally up to 450ms [05:24:59] anyway I am going to upgrade Gerrit from 3.4.4 to 3.45 [05:25:00] err [05:25:02] 3.4.5 [05:32:22] (03PS1) 10Hashar: Gerrit v3.4.5 and rebuild plugins [2] [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/827163 (https://phabricator.wikimedia.org/T315942) [05:33:01] (03CR) 10Hashar: [C: 03+2] Gerrit v3.4.5 and rebuild plugins [2] [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/827163 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [05:33:23] (03Merged) 10jenkins-bot: Gerrit v3.4.5 and rebuild plugins [2] [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/827163 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [05:36:56] !log hashar@deploy1002 Started deploy [gerrit/gerrit@f1a820b]: Gerrit to 3.4.5 on gerrit2002 [05:37:07] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@f1a820b]: Gerrit to 3.4.5 on gerrit2002 (duration: 00m 11s) [05:40:30] !log hashar@deploy1002 Started deploy [gerrit/gerrit@f1a820b]: Gerrit to 3.4.5 on gerrit1001 [05:40:39] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@f1a820b]: Gerrit to 3.4.5 on gerrit1001 (duration: 00m 09s) [05:42:33] * TheresNoTime was about to complain gerrit SSHing wasn't working.. [05:43:48] sorry it is a bit of a sneaky upgrade [05:44:02] !log Restarted Gerrit for 3.4.5 upgrade [05:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:32] TheresNoTime: it is back now! [05:44:46] so it is! :D [05:47:42] (03PS1) 10Marostegui: install_server: Do not reimage db1194 [puppet] - 10https://gerrit.wikimedia.org/r/827166 [05:49:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [05:49:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [05:49:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T316186)', diff saved to https://phabricator.wikimedia.org/P33564 and previous config saved to /var/cache/conftool/dbconfig/20220829-054939-ladsgroup.json [05:55:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe) [05:55:52] 10SRE, 10serviceops: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10Joe) 05Open→03In progress p:05Triage→03Medium [05:55:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T316186)', diff saved to https://phabricator.wikimedia.org/P33565 and previous config saved to /var/cache/conftool/dbconfig/20220829-055554-ladsgroup.json [06:04:26] (03PS1) 10Giuseppe Lavagetto: role::memcached: simple memcached role [puppet] - 10https://gerrit.wikimedia.org/r/827167 (https://phabricator.wikimedia.org/T313968) [06:04:28] (03PS1) 10Giuseppe Lavagetto: mc-wf*: install memcached [puppet] - 10https://gerrit.wikimedia.org/r/827168 (https://phabricator.wikimedia.org/T313968) [06:05:01] (03CR) 10CI reject: [V: 04-1] role::memcached: simple memcached role [puppet] - 10https://gerrit.wikimedia.org/r/827167 (https://phabricator.wikimedia.org/T313968) (owner: 10Giuseppe Lavagetto) [06:06:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37013/console" [puppet] - 10https://gerrit.wikimedia.org/r/827168 (https://phabricator.wikimedia.org/T313968) (owner: 10Giuseppe Lavagetto) [06:07:36] taking a break for kids breakfast [06:08:42] (03PS2) 10Giuseppe Lavagetto: role::memcached: simple memcached role [puppet] - 10https://gerrit.wikimedia.org/r/827167 (https://phabricator.wikimedia.org/T313968) [06:08:45] (03PS2) 10Giuseppe Lavagetto: mc-wf*: install memcached [puppet] - 10https://gerrit.wikimedia.org/r/827168 (https://phabricator.wikimedia.org/T313968) [06:10:27] 10SRE, 10Wikimedia-Mailing-lists: East and South East Asia and Pacific (ESEAP) mailing list set up - https://phabricator.wikimedia.org/T316454 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I created it: https://lists.wikimedia.org/postorius/lists/eseap.lists.wikimedia.org Note that I went with "Public... [06:11:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P33566 and previous config saved to /var/cache/conftool/dbconfig/20220829-061100-ladsgroup.json [06:13:29] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:04] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old templatelinks fields in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826773 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [06:16:03] (03Merged) 10jenkins-bot: Stop writing to old templatelinks fields in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826773 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [06:18:03] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10George_Chernilevsky) '''File:Spöke.jpg''... [06:22:04] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:826773|Stop writing to old templatelinks fields in commons (T312865)]] (duration: 03m 43s) [06:22:09] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [06:24:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:24:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:24:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:25:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:26:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P33567 and previous config saved to /var/cache/conftool/dbconfig/20220829-062607-ladsgroup.json [06:35:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:35:54] 10SRE, 10Wikimedia-Mailing-lists: East and South East Asia and Pacific (ESEAP) mailing list set up - https://phabricator.wikimedia.org/T316454 (10AmandaSLawrence) Great thanks so much Viel. Very quick. I'll definitely add more owners and review the set up then invite people. Amanda [06:40:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:40:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:41:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T316186)', diff saved to https://phabricator.wikimedia.org/P33568 and previous config saved to /var/cache/conftool/dbconfig/20220829-064113-ladsgroup.json [06:41:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [06:41:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [06:41:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:41:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:41:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T316186)', diff saved to https://phabricator.wikimedia.org/P33569 and previous config saved to /var/cache/conftool/dbconfig/20220829-064154-ladsgroup.json [06:44:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:48:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::memcached: simple memcached role [puppet] - 10https://gerrit.wikimedia.org/r/827167 (https://phabricator.wikimedia.org/T313968) (owner: 10Giuseppe Lavagetto) [06:48:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T316186)', diff saved to https://phabricator.wikimedia.org/P33570 and previous config saved to /var/cache/conftool/dbconfig/20220829-064811-ladsgroup.json [06:49:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mc-wf*: install memcached [puppet] - 10https://gerrit.wikimedia.org/r/827168 (https://phabricator.wikimedia.org/T313968) (owner: 10Giuseppe Lavagetto) [07:00:04] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T0700) [07:00:04] Urbanecm: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:23] I’ll self-serve [07:02:02] (03PS3) 10Urbanecm: Revert "[beta] Temporarily allow everyone to enroll as mentor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826623 (https://phabricator.wikimedia.org/T310905) [07:02:06] (03CR) 10Urbanecm: [C: 03+2] Revert "[beta] Temporarily allow everyone to enroll as mentor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826623 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [07:02:42] (03PS2) 10Urbanecm: cswiki: fix extendedconfirmed permission for bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826955 [07:02:47] (03CR) 10Urbanecm: [C: 03+2] cswiki: fix extendedconfirmed permission for bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826955 (owner: 10Urbanecm) [07:02:50] (03Merged) 10jenkins-bot: Revert "[beta] Temporarily allow everyone to enroll as mentor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826623 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [07:03:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P33571 and previous config saved to /var/cache/conftool/dbconfig/20220829-070318-ladsgroup.json [07:03:33] (03Merged) 10jenkins-bot: cswiki: fix extendedconfirmed permission for bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826955 (owner: 10Urbanecm) [07:04:05] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1194 [puppet] - 10https://gerrit.wikimedia.org/r/827166 (owner: 10Marostegui) [07:05:03] (03PS1) 10KartikMistry: Enable SectionTranslation on 10 more WPs where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827174 (https://phabricator.wikimedia.org/T313300) [07:05:46] (03PS5) 10Slyngshede: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) [07:06:20] (03CR) 10Slyngshede: P:dbbackups::mydumper Move mydumper from cron to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:06:31] 10ops-codfw, 10DBA: db2149 is sad after reboot - https://phabricator.wikimedia.org/T316494 (10Marostegui) a:03Papaul It is definitely broken: ` root@db2149:~# dmesg -bash: /usr/bin/dmesg: Input/output error ` @Papaul can you see which disk is broken from your side? [07:06:38] (03PS1) 10Giuseppe Lavagetto: memcached: adapt values to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/827175 [07:07:09] (03PS3) 10Urbanecm: Revert "testwiki: Growth: Assign enrollasmentor to *" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826622 (https://phabricator.wikimedia.org/T310905) [07:07:14] (03CR) 10Urbanecm: [C: 03+2] Revert "testwiki: Growth: Assign enrollasmentor to *" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826622 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [07:07:16] (03PS1) 10Marostegui: db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/827176 (https://phabricator.wikimedia.org/T316494) [07:08:03] (03CR) 10Marostegui: [C: 03+2] db2149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/827176 (https://phabricator.wikimedia.org/T316494) (owner: 10Marostegui) [07:08:05] (03Merged) 10jenkins-bot: Revert "testwiki: Growth: Assign enrollasmentor to *" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826622 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [07:09:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:10:04] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 20d62380d5e33931a3e6e4c5696a3cd179ff0eb1: cswiki: fix extendedconfirmed permission for bot group (duration: 03m 43s) [07:10:05] (03PS1) 10Marostegui: dbproxy1013,dbproxy1015: Add db1164 as standby [puppet] - 10https://gerrit.wikimedia.org/r/827177 (https://phabricator.wikimedia.org/T316202) [07:11:06] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37014/console" [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:11:11] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37015/console" [puppet] - 10https://gerrit.wikimedia.org/r/827175 (owner: 10Giuseppe Lavagetto) [07:11:47] (03PS2) 10Marostegui: dbproxy1013,dbproxy1015: Add db1164 as standby [puppet] - 10https://gerrit.wikimedia.org/r/827177 (https://phabricator.wikimedia.org/T316202) [07:12:30] (03CR) 10Marostegui: [C: 03+2] dbproxy1013,dbproxy1015: Add db1164 as standby [puppet] - 10https://gerrit.wikimedia.org/r/827177 (https://phabricator.wikimedia.org/T316202) (owner: 10Marostegui) [07:13:25] (03PS4) 10Urbanecm: GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno) [07:13:28] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno) [07:13:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:13:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:14:18] (03Merged) 10jenkins-bot: GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno) [07:14:38] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 88b3ce8196927d46f13d05aa8f3467992832f09d: Revert "testwiki: Growth: Assign enrollasmentor to *" (T310905, T314414) (duration: 03m 32s) [07:14:43] T314414: Grant "enrollasmentor" right automatically according to community's specifications - https://phabricator.wikimedia.org/T314414 [07:14:43] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [07:14:54] * urbanecm done [07:15:03] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] memcached: adapt values to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/827175 (owner: 10Giuseppe Lavagetto) [07:16:04] (03PS1) 10Marostegui: mariadb: Promote db1164 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/827398 (https://phabricator.wikimedia.org/T316202) [07:17:26] (03CR) 10Jcrespo: [C: 04-1] "No comment has been added, or justification has been given why a comment shouldn't be added." [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:17:33] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [07:17:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2133,2160].codfw.wmnet,db[1117,1159,1164].eqiad.wmnet with reason: Switchover m2 T316202 [07:17:43] T316202: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 [07:17:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:17:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2133,2160].codfw.wmnet,db[1117,1159,1164].eqiad.wmnet with reason: Switchover m2 T316202 [07:17:57] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [07:18:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P33572 and previous config saved to /var/cache/conftool/dbconfig/20220829-071824-ladsgroup.json [07:19:47] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1164 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/827398 (https://phabricator.wikimedia.org/T316202) (owner: 10Marostegui) [07:21:23] (03PS6) 10Slyngshede: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) [07:21:33] (03PS5) 10Giuseppe Lavagetto: Move 0.1% of user traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736) [07:22:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:23:00] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37016/console" [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:23:46] <_joe_> urbanecm: is the backport window finished? [07:23:48] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [07:23:54] _joe_: yes it is. [07:24:00] <_joe_> ack thanks [07:24:14] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [07:25:24] (03CR) 10Slyngshede: [V: 03+1] P:dbbackups::mydumper Move mydumper from cron to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:25:26] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/827444 [07:25:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:25:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:26:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:29:00] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/827444 (owner: 10Marostegui) [07:30:56] !log Failover m3-master [07:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T316186)', diff saved to https://phabricator.wikimedia.org/P33573 and previous config saved to /var/cache/conftool/dbconfig/20220829-073330-ladsgroup.json [07:33:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [07:33:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [07:33:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33574 and previous config saved to /var/cache/conftool/dbconfig/20220829-073354-ladsgroup.json [07:35:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33575 and previous config saved to /var/cache/conftool/dbconfig/20220829-073516-ladsgroup.json [07:41:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33576 and previous config saved to /var/cache/conftool/dbconfig/20220829-074124-ladsgroup.json [07:45:32] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:16] (03PS1) 10Majavah: wikimediacloud.org: do not use CNAMEs for nsX addresses [dns] - 10https://gerrit.wikimedia.org/r/827446 [07:47:15] (03CR) 10CI reject: [V: 04-1] wikimediacloud.org: do not use CNAMEs for nsX addresses [dns] - 10https://gerrit.wikimedia.org/r/827446 (owner: 10Majavah) [07:48:02] (03PS2) 10Majavah: wikimediacloud.org: do not use CNAMEs for nsX addresses [dns] - 10https://gerrit.wikimedia.org/r/827446 [07:48:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:49:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:49:19] (03CR) 10Jcrespo: [C: 03+1] "It would be nice to copy that explanation on the commit message, so it is not empty, but in terms of functionality, this works for me." [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:50:15] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10taavi) >>! In T315955#8188684, @Andrew wrote: > This is good to know! I only recently changed those to CNAMEs, so I'll switch them back when... [07:52:04] (03PS7) 10Slyngshede: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) [07:52:40] (03CR) 10Slyngshede: P:dbbackups::mydumper Move mydumper from cron to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:55:08] (03PS8) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:56:19] (03PS9) 10Jcrespo: P:dbbackups::mydumper Move mydumper from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792113 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:56:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P33577 and previous config saved to /var/cache/conftool/dbconfig/20220829-075630-ladsgroup.json [07:56:51] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [07:56:55] (03CR) 10Ladsgroup: [C: 03+1] "wohoo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:58:05] (03CR) 10Vgutierrez: [C: 03+2] Increase roll-out of query-sorting to 50% [puppet] - 10https://gerrit.wikimedia.org/r/826994 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [07:58:23] !log Increase roll-out of query-sorting to 50% - T314868 [07:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:30] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [07:59:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move 0.1% of user traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [08:00:58] (03CR) 10Ayounsi: [C: 03+1] "Looked at the code to make sure that the group is created as well." [puppet] - 10https://gerrit.wikimedia.org/r/826869 (owner: 10Muehlenhoff) [08:02:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:03:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:03:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:03:12] (03CR) 10Ayounsi: rancid: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [08:03:42] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:05:41] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Moving 0.1% of users to php 7.4 (duration: 03m 52s) [08:06:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1164 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/827398 (https://phabricator.wikimedia.org/T316202) (owner: 10Marostegui) [08:06:47] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [08:07:01] (03PS13) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [08:07:25] (03CR) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:07:59] (03PS2) 10KartikMistry: Enable SectionTranslation on 10 more WPs where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827174 (https://phabricator.wikimedia.org/T313300) [08:09:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:09:25] (03PS2) 10David Caro: tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 [08:09:34] (03CR) 10David Caro: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [08:10:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:10:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:11:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P33578 and previous config saved to /var/cache/conftool/dbconfig/20220829-081136-ladsgroup.json [08:12:19] (03PS4) 10David Caro: wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) [08:12:31] (03CR) 10David Caro: wmcs.openstack.quota_increase: allow all known quota types (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [08:14:18] (03CR) 10CI reject: [V: 04-1] tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [08:14:54] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) I am going to recover this host, as for the troubleshooting we'll focus on db1132. [08:16:07] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:17:28] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) p:05High→03Medium Host catching up now, we are not pooling it back anyways for now. [08:18:53] (03CR) 10CI reject: [V: 04-1] wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [08:22:01] (03PS1) 10Andrea Denisse: netmon: Rotate logs as the www-data user and librenms group. [puppet] - 10https://gerrit.wikimedia.org/r/827450 (https://phabricator.wikimedia.org/T315393) [08:22:41] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [08:24:05] (03CR) 10Vgutierrez: [C: 03+2] varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [08:26:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33579 and previous config saved to /var/cache/conftool/dbconfig/20220829-082643-ladsgroup.json [08:28:10] (03CR) 10FNegri: ceph.bootstrap_and_add: add support to change the osd class type (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:31:21] (03CR) 10Andrea Denisse: "Hello, please read the following comment for the rationale on this changes: https://phabricator.wikimedia.org/T315393#8192818" [puppet] - 10https://gerrit.wikimedia.org/r/827450 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [08:31:32] !log Failover m2 from db1159 to db1164 - T316202 [08:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:37] T316202: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 [08:32:20] all done [08:32:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33580 and previous config saved to /var/cache/conftool/dbconfig/20220829-083258-ladsgroup.json [08:33:16] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [08:35:33] 10SRE, 10Gerrit, 10Traffic, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10kostajh) >>! In T191183#6658569, @gerritbot wrote: > Change 456437 **abandoned** by Hashar: > [operations/puppet@production] Gerrit: S... [08:35:41] (03CR) 10FNegri: [C: 03+1] wmcs.openstack.quota_increase: allow all known quota types (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [08:35:45] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [08:37:18] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) All done [08:37:27] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) 05Open→03Resolved [08:48:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P33581 and previous config saved to /var/cache/conftool/dbconfig/20220829-084804-ladsgroup.json [08:50:29] (03PS2) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [08:51:40] (03CR) 10Slyngshede: "That should handle most of Johns comments. There's still some work left to do on type-hinting and documentation." [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [08:52:10] (03CR) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. (032 comments) [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [08:55:40] !log test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338 T316337 [08:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:46] T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 [08:55:47] T316337: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 [08:56:06] (03PS3) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [08:56:35] (03CR) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. (033 comments) [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [08:59:35] (03PS1) 10Jelto: sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) [09:03:07] (03CR) 10CI reject: [V: 04-1] sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:03:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P33582 and previous config saved to /var/cache/conftool/dbconfig/20220829-090310-ladsgroup.json [09:03:23] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [09:03:35] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:00] (03PS2) 10Jelto: sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) [09:10:10] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [09:10:34] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [09:10:58] (03CR) 10CI reject: [V: 04-1] sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:14:34] hashar: The sync between our repo and github isn't working, could be related to your gerrit restart? [09:16:10] marostegui: the replication failure was only for our own gerrit-replica.wikimedia.org [09:16:25] then I have restarted Gerrit entirely earlier this morning, I guess it is still processing the full replication [09:16:27] checking [09:16:50] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [09:17:01] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kubernetes: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826827 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [09:18:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33583 and previous config saved to /var/cache/conftool/dbconfig/20220829-091816-ladsgroup.json [09:18:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:18:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:18:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33584 and previous config saved to /var/cache/conftool/dbconfig/20220829-091840-ladsgroup.json [09:18:45] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:19:07] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:20:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33585 and previous config saved to /var/cache/conftool/dbconfig/20220829-092005-ladsgroup.json [09:20:50] marostegui: it is working for both Github and our replica. For GitHub it is currently at the mediawiki extension CategoryTests [09:20:58] I should probably add a few more threads [09:22:20] (03CR) 10Clément Goubert: [C: 03+2] ml-staging: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826860 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [09:22:34] hashar: I was checking our puppet one and doesn't seem to be updated [09:23:49] your update must be in the replication queue [09:23:59] (03PS1) 10Marostegui: dbproxy1013,dbproxy1015: Add db1117 as standby [puppet] - 10https://gerrit.wikimedia.org/r/827458 (https://phabricator.wikimedia.org/T316500) [09:24:05] hashar: Is it possible that it can take hours? [09:25:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33586 and previous config saved to /var/cache/conftool/dbconfig/20220829-092511-ladsgroup.json [09:25:42] marostegui: I checked on the last Gerrit restart, the full replication to GitHub takes ~ 12 hours :-\ [09:26:05] oh wow XD [09:26:19] hashar: good, then, at least we know it! thank you for looking into it! [09:26:23] I have no idea why [09:26:29] would be something to look into I guess [09:28:53] [2022-08-29 09:28:27] Replication to git@github.com:wikimedia/mediawiki-extensions-CommentStreams [09:28:53] completed in 52862ms, 13363655ms delay, 0 retries [CONTEXT pushOneId="1984f22e" ] [09:29:01] not sure why it takes 52 seconds [09:34:23] (03CR) 10Jcrespo: [C: 03+1] dbproxy1013,dbproxy1015: Add db1117 as standby [puppet] - 10https://gerrit.wikimedia.org/r/827458 (https://phabricator.wikimedia.org/T316500) (owner: 10Marostegui) [09:40:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P33587 and previous config saved to /var/cache/conftool/dbconfig/20220829-094017-ladsgroup.json [09:40:20] (03CR) 10Marostegui: [C: 03+2] dbproxy1013,dbproxy1015: Add db1117 as standby [puppet] - 10https://gerrit.wikimedia.org/r/827458 (https://phabricator.wikimedia.org/T316500) (owner: 10Marostegui) [09:41:55] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10ayounsi) 05Resolved→03Open According to {T290899} e4 is now fully setup but the Icinga alerts are still active. Either the root cause should be fixed, or the check removed (or downtimed until they're expecte... [09:41:58] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10ayounsi) [09:42:57] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:43:21] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:43:55] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826843 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [09:46:32] (03PS4) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [09:48:31] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:48:37] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:48:40] (03PS3) 10Clément Goubert: kubestage: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826828 (https://phabricator.wikimedia.org/T316341) [09:48:42] (03PS3) 10Clément Goubert: kubernetes: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826840 (https://phabricator.wikimedia.org/T316341) [09:48:44] (03PS3) 10Clément Goubert: ml-staging: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826833 (https://phabricator.wikimedia.org/T316341) [09:51:22] (03PS3) 10Clément Goubert: dse-k8s: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826845 (https://phabricator.wikimedia.org/T316341) [09:51:40] (03PS3) 10Clément Goubert: releases: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826852 (https://phabricator.wikimedia.org/T316341) [09:52:59] (03CR) 10Clément Goubert: [C: 03+2] dse-k8s: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826845 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [09:55:19] (03PS3) 10Clément Goubert: builder: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826853 (https://phabricator.wikimedia.org/T316341) [09:55:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P33589 and previous config saved to /var/cache/conftool/dbconfig/20220829-095523-ladsgroup.json [09:56:01] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:56:01] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:56:53] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:57:14] ^ all those are expected [09:57:14] (03CR) 10Clément Goubert: [C: 03+2] builder: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826853 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [09:57:17] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:57:25] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:58:11] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:59:31] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:59:39] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:00:11] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/827461 [10:00:25] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:00:31] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:00:33] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:00:33] (03PS3) 10Clément Goubert: ml-serve: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826842 (https://phabricator.wikimedia.org/T316341) [10:00:41] (03PS3) 10Clément Goubert: deployment-server: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826849 (https://phabricator.wikimedia.org/T316341) [10:00:55] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method [10:01:20] (03PS4) 10Clément Goubert: R:profile::docker::engine::version removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341) [10:01:23] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:03:05] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:05:58] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [10:07:03] (03PS1) 10Marostegui: mariadb: Move db1159 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/827462 (https://phabricator.wikimedia.org/T316500) [10:08:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1159 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/827462 (https://phabricator.wikimedia.org/T316500) (owner: 10Marostegui) [10:08:35] (03PS1) 10Kosta Harlan: Fix WelcomeSurvey CentralAuthPostLoginRedirect hook [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827191 (https://phabricator.wikimedia.org/T315583) [10:09:09] !log test trafficserver: Hide non session cookies during cache lookup in drmrs - T316338 T316337 [10:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:15] T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 [10:09:16] T316337: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 [10:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33590 and previous config saved to /var/cache/conftool/dbconfig/20220829-101029-ladsgroup.json [10:10:30] (03Abandoned) 10Kosta Harlan: Fix WelcomeSurvey CentralAuthPostLoginRedirect hook [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827191 (https://phabricator.wikimedia.org/T315583) (owner: 10Kosta Harlan) [10:13:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33591 and previous config saved to /var/cache/conftool/dbconfig/20220829-101345-ladsgroup.json [10:16:53] (03CR) 10JMeybohm: Add a helmfile configuration for the dse-k8s-eqiad cluster (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [10:18:13] (03CR) 10Kosta Harlan: [C: 04-1] Declare mediawiki.createaccount_blocked_user schema (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [10:25:35] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:25:39] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:28:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P33592 and previous config saved to /var/cache/conftool/dbconfig/20220829-102851-ladsgroup.json [10:29:54] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Jeff_G) How can file revisions go missin... [10:33:56] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Jelto) @gmodena @Tchanders Can you clarify if `analytics-privatedata-users` is the correct group here? Is this a similar Jupyte... [10:40:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:41:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:41:42] mmm [10:42:58] I see the runbook is non-existent [10:43:03] quick on the draw Manuel. [10:43:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P33593 and previous config saved to /var/cache/conftool/dbconfig/20220829-104358-ladsgroup.json [10:44:55] TLS exchange is fine to videoscaler.svc.eqiad.wmnet. [10:45:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:49:34] LVS logs show this: https://phabricator.wikimedia.org/P33594 [10:50:41] PROBLEM - Check systemd state on cp6002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_varnish-frontend-hospital.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:22] (03PS3) 10Hnowlan: restbase: add restbase103[123] [puppet] - 10https://gerrit.wikimedia.org/r/803520 [10:56:59] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33595 and previous config saved to /var/cache/conftool/dbconfig/20220829-105904-ladsgroup.json [10:59:07] (03CR) 10Hnowlan: [C: 03+2] restbase: add restbase103[123] [puppet] - 10https://gerrit.wikimedia.org/r/803520 (owner: 10Hnowlan) [10:59:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:59:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:59:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T316186)', diff saved to https://phabricator.wikimedia.org/P33596 and previous config saved to /var/cache/conftool/dbconfig/20220829-105928-ladsgroup.json [11:00:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:02:50] mw1437 has maxed CPU since about 02:00 [11:02:58] ffmpeg responsible [11:03:13] which is causing the jobrunner and videoscaler timeouts [11:05:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T316186)', diff saved to https://phabricator.wikimedia.org/P33597 and previous config saved to /var/cache/conftool/dbconfig/20220829-110548-ladsgroup.json [11:20:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P33598 and previous config saved to /var/cache/conftool/dbconfig/20220829-112054-ladsgroup.json [11:21:44] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1031.eqiad.wmnet with OS buster [11:23:27] (03CR) 10Marostegui: "I am not sure we should stop replication, some changes might fail with replication enabled (those on massive tables mostly). Why do you wa" [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup) [11:28:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Siko_WMDE) Hi, I have access to analytics-privatedata-users now. Thank you and all the best, Simon [11:29:15] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Aklapper) @Jeff_G: Because software code... [11:33:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1031.eqiad.wmnet with reason: host reimage [11:36:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P33599 and previous config saved to /var/cache/conftool/dbconfig/20220829-113600-ladsgroup.json [11:37:31] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1031.eqiad.wmnet with reason: host reimage [11:38:36] sobanski: Now that I think about it...should we create an actionable somewhere or something to create and fill out the (non) documentation at https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 [11:38:37] ? [11:39:21] I was thinking about that but didn't want to interrupt troubleshooting. I can create a task. [11:39:52] thanks! [11:48:12] sobanski: thanks, there is some info here about the jobrunners: https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Jobrunners [11:48:49] This case seems to match that scenario, I'm unsure if it's worth removing some hosts from the videoscaler cluster as it suggests in this case or not [11:49:54] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for junuche, demon, jhuneidi - https://phabricator.wikimedia.org/T315887 (10Jelto) 05Open→03Resolved a:03Jelto Users `jnuche`, `demon` and `jhuneidi` were added to the ldap group `gerritadmin`. Please note th... [11:51:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T316186)', diff saved to https://phabricator.wikimedia.org/P33600 and previous config saved to /var/cache/conftool/dbconfig/20220829-115107-ladsgroup.json [11:51:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:51:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:55:59] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 20+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [12:00:10] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:10] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) [12:01:32] 10SRE, 10Image-Suggestions, 10serviceops: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (10WDoranWMF) 05Open→03Resolved a:03WDoranWMF [12:01:39] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10hnowlan) [12:02:00] 10SRE, 10Image-Suggestions, 10serviceops, 10Patch-For-Review: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10hnowlan) 05Open→03Resolved [12:02:58] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on restbase1031.eqiad.wmnet with reason: New host [12:03:00] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on restbase1031.eqiad.wmnet with reason: New host [12:03:22] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:44] !log joining restbase1031-a to cassandra cluster [12:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:30] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:08:10] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host restbase1031.eqiad.wmnet with OS buster [12:14:20] !log rolling restart of ats-be fleet wide to apply "Hide non session cookies during cache lookup" - T316338 T316337 [12:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:26] T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 [12:14:26] T316337: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 [12:23:44] (03CR) 10Krinkle: [C: 03+1] doc: properly redirect back compat URLs [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [12:29:42] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 20+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [12:30:38] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [12:34:22] 10SRE, 10SRE-Access-Requests: Requesting access to deploy-phabricator, gerrit-deployers, gerrit-root, phabricator-roots for jhuneidi - https://phabricator.wikimedia.org/T316521 (10thcipriani) [12:37:17] 10SRE, 10SRE-Access-Requests: Requesting access to deploy-phabricator for dancy - https://phabricator.wikimedia.org/T316524 (10thcipriani) [12:39:28] 10SRE, 10SRE-Access-Requests: Requesting access to deploy-phabricator, gerrit-root, phabricator-roots for jhuneidi - https://phabricator.wikimedia.org/T316521 (10thcipriani) [12:39:33] (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827490 (https://phabricator.wikimedia.org/T315353) [12:39:53] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10ayounsi) [12:40:01] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ayounsi) [12:40:21] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10ayounsi) 05Resolved→03Open Re-opening as I noticed that alerting was still disabled for that device and the power briefly goes above threshold. See https://librenms.wikimedia.org/device/d... [12:40:22] (03PS1) 10Bartosz Dziewoński: persistRevisionThreadItems: Allow processing current revisions only [extensions/DiscussionTools] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827196 (https://phabricator.wikimedia.org/T315510) [12:43:02] 10SRE, 10SRE-Access-Requests: Requesting access to deploy-phabricator, gerrit-root for dduvall - https://phabricator.wikimedia.org/T316526 (10thcipriani) [12:43:44] (03PS1) 10Marostegui: x1: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/827491 [12:44:19] (03PS2) 10Marostegui: x1: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/827491 [12:45:02] 10SRE, 10SRE-Access-Requests: Requesting access to deploy-phabricator for hashar - https://phabricator.wikimedia.org/T316527 (10thcipriani) [12:46:24] 10SRE, 10SRE-Access-Requests: Requesting access to deploy-phabricator for jnuche - https://phabricator.wikimedia.org/T316528 (10thcipriani) [12:49:00] RECOVERY - Check systemd state on dse-k8s-worker1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:18] (03CR) 10DCausse: [C: 03+1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [12:52:38] (03CR) 10Jforrester: [C: 03+1] Run initSiteStats twice a month (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [12:54:25] (03PS1) 10Thcipriani: RelEng Access Requests [puppet] - 10https://gerrit.wikimedia.org/r/827494 (https://phabricator.wikimedia.org/T316528) [12:56:04] PROBLEM - Check systemd state on dse-k8s-worker1008 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:42] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to gerritadmin for junuche, demon, jhuneidi - https://phabricator.wikimedia.org/T315887 (10thcipriani) >>! In T315887#8193397, @Jelto wrote: > Users `jnuche`, `demon` and `jhuneidi` were added to the ldap group `gerritadmin`. Ple... [13:00:03] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T1300). [13:00:04] _joe_, zabe, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ [13:00:21] hi. i have a bunch of unusual stuff, i hope that's okay [13:00:28] <_joe_> o/ [13:01:09] <_joe_> I'll just deploy my patch quickly [13:01:23] <_joe_> it's a silly one-liner anyways [13:01:26] o/ [13:01:35] hey [13:01:44] urbanecm: do you want to deploy or should I? [13:01:54] taavi: go ahead (once _joe_ is done) [13:01:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [13:02:00] ok, sure [13:02:10] (03CR) 10CI reject: [V: 04-1] Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [13:02:29] (I'm around if needed, just ping me) [13:03:21] <_joe_> uhhh [13:03:33] (03PS5) 10Giuseppe Lavagetto: Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736) [13:03:40] you probably just need to rebase against master? [13:03:42] <_joe_> just a rebase [13:03:43] <_joe_> yeah [13:05:03] (03PS5) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) [13:05:32] (03PS2) 10Ori: Increase roll-out of query-sorting to 75% [puppet] - 10https://gerrit.wikimedia.org/r/826996 (https://phabricator.wikimedia.org/T314868) [13:05:46] (03CR) 10CI reject: [V: 04-1] Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [13:06:09] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [13:07:04] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37022/console" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [13:07:43] <_joe_> uhm maybe I need to manually merge? [13:07:48] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:07:59] you need to re-+2 it due to the rebase [13:08:13] <_joe_> zabe: or submit it :) [13:08:24] <_joe_> anyways, done, will be done in a couple minutes [13:08:37] for some reason, commits in mediawiki-config usually need a manual rebase (gerrit's button's enough). not sure _why_, but that's just my experience. [13:08:42] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37023/console" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [13:09:00] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37024/console" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [13:09:21] urbanecm: afaik Gerrit has been manually configured to not auto rebase there, to encourage people to properly look at the diffConfig output etc after a rebase [13:09:31] might be [13:10:16] taavi: friendly reminder: you might want to +2 zabe's backports now, to save a bit of CI's time :) [13:10:40] that's a great idea, I'll do that [13:11:23] (03CR) 10Majavah: "backporting!" [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826615 (owner: 10Zabe) [13:11:31] (03CR) 10Majavah: [C: 03+2] phan: Fix use of IMaintainableDatabase::tableExists [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826615 (owner: 10Zabe) [13:11:32] urbanecm: that is a configuration setting in Gerrit. The repo mediawiki-config is set to not attempt to resolve a conflict when a file proposed in the patch has been touched in the target branch [13:11:51] hashar: so basically what taavi said? [13:11:54] (03CR) 10Majavah: [C: 03+2] Use real transactions when managing the voter list for a poll [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826614 (https://phabricator.wikimedia.org/T316150) (owner: 10Zabe) [13:12:10] urbanecm: so if a change modifies InitialiseSettings.php and the branch had an update touching that file, Gerrit marks it as being in conflict which requires a rebase [13:12:14] gotcha [13:12:21] ah yes what taavi said :] [13:12:22] (03CR) 10Vgutierrez: [C: 03+2] Increase roll-out of query-sorting to 75% [puppet] - 10https://gerrit.wikimedia.org/r/826996 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [13:12:23] sorry [13:12:33] that matches my experience, but i wasn't sure whether that's a bug or intentional setting. thanks for clarifying. [13:12:48] !log Increase roll-out of query-sorting to 75% - T314868 [13:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:53] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [13:13:04] (03PS1) 10Clément Goubert: parsoid: Install parse1* servers with only php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/827498 (https://phabricator.wikimedia.org/T312638) [13:13:05] in the Gerrit UI it should shows up with a [Merge conflict] status indicating one has to carefully consider whether another change might have a conflict [13:13:21] or maybe we can let Gerrit auto merge for us assuming we have a good test coverage it is probably fine [13:14:01] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37026/console" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [13:14:06] (03Merged) 10jenkins-bot: phan: Fix use of IMaintainableDatabase::tableExists [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826615 (owner: 10Zabe) [13:14:19] (03CR) 10Majavah: [C: 03+2] "backport" [extensions/DiscussionTools] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827196 (https://phabricator.wikimedia.org/T315510) (owner: 10Bartosz Dziewoński) [13:14:21] (03Merged) 10jenkins-bot: Use real transactions when managing the voter list for a poll [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826614 (https://phabricator.wikimedia.org/T316150) (owner: 10Zabe) [13:14:28] hashar: personally, i always hit rebase when backporting (and i advise fellow deployers the same). i also saw quite few deployers confused by it, so...i vote for allowing autorebase there [13:14:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:14:51] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Moving 1% of users to php 7.4 (duration: 04m 18s) [13:15:47] <_joe_> I'm done [13:15:51] thanks [13:15:57] zabe: your patches are up next [13:16:02] do you have any way of testing them? [13:16:06] no [13:17:15] (03CR) 10Hashar: gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [13:17:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator for jnuche - https://phabricator.wikimedia.org/T316528 (10jnuche) Thanks @thcipriani ! [13:17:20] ok, I'll just sync in that case [13:17:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:17:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:17:54] marostegui: looks like Gerrit has completed the replications to GitHub [13:18:19] (03PS1) 10Clément Goubert: deployment-prep: Switch over to bullseye docker nodes [puppet] - 10https://gerrit.wikimedia.org/r/827499 (https://phabricator.wikimedia.org/T316341) [13:18:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:21:27] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/SecurePoll/: T316150 (duration: 03m 44s) [13:21:32] T316150: Adding a global account to override list gives DBTransactionSizeError - https://phabricator.wikimedia.org/T316150 [13:21:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] deployment-prep: Switch over to bullseye docker nodes [puppet] - 10https://gerrit.wikimedia.org/r/827499 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [13:21:37] zabe: done [13:21:42] MatmaRex: yours are up next [13:21:45] thanks! [13:21:54] (03Merged) 10jenkins-bot: persistRevisionThreadItems: Allow processing current revisions only [extensions/DiscussionTools] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827196 (https://phabricator.wikimedia.org/T315510) (owner: 10Bartosz Dziewoński) [13:21:55] does the order matter here? [13:22:30] taavi: thanks, which exactly? [13:22:45] taavi: createExtensionTables.php needs to be before the config patch [13:22:54] and the backport match needs to be before persistRevisionThreadItems.php [13:22:57] (03CR) 10Marostegui: [C: 03+2] x1: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/827491 (owner: 10Marostegui) [13:23:06] hashar: indeed!! [13:23:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:23:26] ok [13:23:39] !log taavi@mwmaint1002 ~ $ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki testwiki discussiontools [13:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:04] (03PS2) 10Majavah: Enable wgDiscussionToolsEnablePermalinksBackend on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827490 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [13:24:13] (03CR) 10Majavah: [C: 03+2] Enable wgDiscussionToolsEnablePermalinksBackend on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827490 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [13:24:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:24:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:35] the other maint script might take longer, probably a few minutes, maybe up to a few hours [13:24:44] if you can, please record how long it takes and the output :) [13:24:53] (03CR) 10Clément Goubert: [C: 03+2] deployment-prep: Switch over to bullseye docker nodes [puppet] - 10https://gerrit.wikimedia.org/r/827499 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [13:24:58] hmm, ok [13:25:06] (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827490 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [13:25:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:43] MatmaRex: the backport is available for testing on mwdebug1001 [13:26:00] syncing the maintenance script backport in the meantime [13:26:09] taavi: thanks, it's just a change to the maintenace script so i can't test anything [13:26:18] sorry, the config change I mean [13:26:49] (03PS3) 10KartikMistry: Enable SectionTranslation on 10 more WPs where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827174 (https://phabricator.wikimedia.org/T313300) [13:26:53] (03PS1) 10KartikMistry: testwiki: Fix language code for Bhojpuri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827500 (https://phabricator.wikimedia.org/T313296) [13:27:12] oh, looking [13:27:59] seems good [13:28:05] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [13:28:15] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10ayounsi) [13:29:00] ok, will sync that once the maintenance script has been synced [13:29:29] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php: Backport: [[gerrit:827196|persistRevisionThreadItems: Allow processing current revisions only (T315510)]] (duration: 03m 40s) [13:29:34] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [13:30:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:31:08] (03CR) 10Ladsgroup: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/827461 (owner: 10Marostegui) [13:31:13] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/827461 (owner: 10Marostegui) [13:31:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:31:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:31:24] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/827461 (owner: 10Marostegui) [13:31:38] !log Failover m5 master [13:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:33:24] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:827490|Enable wgDiscussionToolsEnablePermalinksBackend on testwiki (T315353)]] (duration: 03m 48s) [13:33:28] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [13:34:21] hmmm [13:34:23] taavi@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki=testwiki --current --all | tee persistRevisionThreadItems.log [13:34:23] Bad MediaWiki install path: /srv/mediawiki/php-1.39.0-wmf.26 [13:34:41] urbanecm: ^ I haven't seen anything like this before [13:35:08] Let me see [13:35:35] hm [13:35:40] I'm in a screen, if that helps at all [13:35:45] Ack [13:35:48] is that an error from the script itself? [13:35:53] I don't know [13:35:53] yeah [13:36:00] reproducible with `[urbanecm@mwmaint1002 ~]$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki=testwiki --help` [13:36:00] it's due to the dots in the filepath [13:36:10] which dots? [13:36:11] dots? [13:36:11] https://gerrit.wikimedia.org/g/mediawiki/extensions/DiscussionTools/+/6f5bcf2c09b46b83efef799fb7c70cc471802618/maintenance/persistRevisionThreadItems.php#24 [13:36:15] it is from the script [13:36:16] php-1.39.0-wmf.26 [13:36:19] 1 dot 39 [13:36:21] ahhh [13:36:29] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37029/console" [puppet] - 10https://gerrit.wikimedia.org/r/827498 (https://phabricator.wikimedia.org/T312638) (owner: 10Clément Goubert) [13:36:32] why does the script do that? [13:36:41] good question [13:36:44] security team insisted that i add it even after i explained it's not necessary [13:36:55] it...makes it impossible to run the script [13:37:04] i supposed i can delete that line and we can backport it [13:37:18] i'd be fine with that, but i'm really curious to hear more about the recommendation [13:37:44] perhaps sbassett can clarify it? [13:37:49] one sec [13:38:01] I don't think that requirement makes any sense, but if secteam wanted that to be there I don't think I'm willing to backport a change removing it without further clarification [13:38:05] (03PS3) 10Clément Goubert: parsoid: Install parse1* servers with only php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/827498 (https://phabricator.wikimedia.org/T312638) [13:38:21] there's a long discussion about it on https://phabricator.wikimedia.org/T242134 [13:38:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10akosiaris) [13:38:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10akosiaris) >>! In T313873#8189120, @Jclark-ctr wrote: > @akosiaris Can you verify host names? kubernetes102[01] Already in use Racking task T290202 Indeed. My mistake. Upd... [13:39:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [13:40:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [13:40:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T316186)', diff saved to https://phabricator.wikimedia.org/P33601 and previous config saved to /var/cache/conftool/dbconfig/20220829-134014-ladsgroup.json [13:40:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] parsoid: Install parse1* servers with only php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/827498 (https://phabricator.wikimedia.org/T312638) (owner: 10Clément Goubert) [13:41:36] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [13:41:58] well, anyway, it's https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/827501 if anyone is up for reviewing it [13:43:04] (03CR) 10Clément Goubert: [C: 03+2] parsoid: Install parse1* servers with only php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/827498 (https://phabricator.wikimedia.org/T312638) (owner: 10Clément Goubert) [13:43:33] (03PS6) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) [13:43:47] (03CR) 10Herron: [C: 03+1] netmon: Rotate logs as the www-data user and librenms group. [puppet] - 10https://gerrit.wikimedia.org/r/827450 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [13:45:17] 10SRE, 10Infrastructure-Foundations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) [13:45:23] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [13:46:11] MatmaRex: yeah, i agree with taavi. the requirement doesn't make any sense to me, but before making a decision, I'd like to hear secteam's clarification first. is it possible to skip running the script today? [13:46:29] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) p:05Triage→03Low [13:46:40] yeah [13:46:51] let's do that then [13:46:59] does anyone have anything else to deploy? [13:47:33] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10ayounsi) [13:47:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T316186)', diff saved to https://phabricator.wikimedia.org/P33602 and previous config saved to /var/cache/conftool/dbconfig/20220829-134736-ladsgroup.json [13:47:41] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [13:47:51] I could deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/820398, no-op config cleanup [13:47:54] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10taavi) [13:47:54] if there’s nothing else to do [13:48:10] sure, do you want to self-deploy or want me to deploy it? [13:48:18] I can self-deploy :) [13:48:24] go ahead [13:48:29] thankx [13:48:31] *thanks [13:49:23] (03PS2) 10Lucas Werkmeister (WMDE): Remove unused SearchSettingsForSDC.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 [13:49:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "sync Wikibase.php first, then SDC.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 (owner: 10Lucas Werkmeister (WMDE)) [13:49:40] MatmaRex: is there a task "script cannot be executed" to discuss this? or should i/we fill one? [13:50:27] there isn't [13:50:32] feel free to start one [13:50:37] okay, I'll create one [13:50:54] (or i can do it later, i'm awya for a bit now) [13:50:54] (03Merged) 10jenkins-bot: Remove unused SearchSettingsForSDC.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 (owner: 10Lucas Werkmeister (WMDE)) [13:50:59] thanks [13:51:27] quickly testing my change on mwdebug1001 [13:52:45] syncing [13:54:51] I’m not sure I’ve ever deleted a file through scap before, but I assume I just sync-file the deleted path and rsync will do the right thing [13:55:43] scap pull seems to have deleted it on mwdebug1001, at least [13:56:12] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikibase.php: Config: [[gerrit:820398|Remove unused SearchSettingsForSDC.php]] (1/2, no-op) (duration: 03m 32s) [13:56:50] grmbl [13:56:52] FileNotFoundError [13:57:35] so… sync-file all of wmf-config/ instead? [13:57:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:58:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:58:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:59:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:00:06] PROBLEM - Check systemd state on parse1004 is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:06] yup, looks like syncing the dir is what was done in https://sal.toolforge.org/log/PHhzhoEBa_6PSCT9Z2eR [14:00:14] PROBLEM - Check systemd state on parse1017 is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:21] so I’ll sync all of wmf-config/ unless someone yells at me not to very soon [14:00:35] (and hope nobody left unsynced changes on the deployment host…) [14:01:52] PROBLEM - Check systemd state on parse1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:02] syncing [14:02:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P33603 and previous config saved to /var/cache/conftool/dbconfig/20220829-140243-ladsgroup.json [14:02:52] and it’s gone from mwdebug1002 after the “sync-testservers” step, so far so good [14:03:18] PROBLEM - Check systemd state on parse1024 is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:29] (03PS1) 10Vgutierrez: trafficserver: Set action=never-cache for caching=websockets [puppet] - 10https://gerrit.wikimedia.org/r/827506 (https://phabricator.wikimedia.org/T316545) [14:04:14] RECOVERY - Check systemd state on parse1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:22] RECOVERY - Check systemd state on parse1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:41] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37030/console" [puppet] - 10https://gerrit.wikimedia.org/r/827506 (https://phabricator.wikimedia.org/T316545) (owner: 10Vgutierrez) [14:04:44] taavi: MatmaRex: phabricatorized as T316548. please feel free to add your thoughts there. [14:04:44] T316548: DiscussionTools' maintenance script cannot be executed in Wikimedia production - https://phabricator.wikimedia.org/T316548 [14:04:56] RECOVERY - Check systemd state on parse1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:28] RECOVERY - Check systemd state on parse1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:39] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/: Config: [[gerrit:820398|Remove unused SearchSettingsForSDC.php]] (2/2, no-op; syncing deleted file requires syncing entire directory AFAICT) (duration: 03m 37s) [14:05:52] anything else to deploy? [14:06:30] PROBLEM - PHP7 rendering on parse1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 458 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:06:31] !log UTC afternoon backport+config window done [14:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:32] PROBLEM - mediawiki-installation DSH group on parse1004 is CRITICAL: Host parse1004 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:09:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator for jnuche - https://phabricator.wikimedia.org/T316528 (10Jelto) p:05Triage→03Medium a:03Jelto Thanks for opening the request! Please note the check boxes should be marked by SRE on Clinic Duty. I double checked... [14:09:59] (03PS2) 10Lucas Werkmeister (WMDE): Only set WikibaseCirrusSearch settings if wmg globals are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806872 [14:10:01] (03PS2) 10Lucas Werkmeister (WMDE): Directly set WikibaseCirrusSearch settings in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806873 [14:10:03] (03PS2) 10Lucas Werkmeister (WMDE): Remove unused assignments from SearchSettingsForWikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806874 [14:10:06] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [14:12:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator for hashar - https://phabricator.wikimedia.org/T316527 (10Jelto) p:05Triage→03Medium a:03Jelto Thanks for opening the request! Please note the check boxes should be marked by SRE on Clinic Duty. I double checked... [14:14:28] PROBLEM - PHP7 rendering on parse1008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 458 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:14:28] PROBLEM - PHP7 rendering on parse1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 458 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:14:42] (03CR) 10David Caro: wmcs.openstack.quota_increase: allow all known quota types (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [14:15:29] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix fresh install [puppet] - 10https://gerrit.wikimedia.org/r/827507 [14:15:48] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01017 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:17:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator for dancy - https://phabricator.wikimedia.org/T316524 (10Jelto) p:05Triage→03Medium a:03Jelto Thanks for opening the request! Please note the check boxes should be marked by SRE on Clinic Duty. I double checked a... [14:17:34] PROBLEM - mediawiki-installation DSH group on parse1001 is CRITICAL: Host parse1001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:17:34] PROBLEM - mediawiki-installation DSH group on parse1008 is CRITICAL: Host parse1008 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:17:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P33604 and previous config saved to /var/cache/conftool/dbconfig/20220829-141749-ladsgroup.json [14:19:48] PROBLEM - PHP7 rendering on parse1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 458 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:20:09] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/827507 (owner: 10Giuseppe Lavagetto) [14:20:14] PROBLEM - memcached socket on parse1001 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory https://wikitech.wikimedia.org/wiki/Memcached [14:20:16] PROBLEM - mediawiki-installation DSH group on parse1017 is CRITICAL: Host parse1017 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:21:50] PROBLEM - memcached socket on parse1007 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory https://wikitech.wikimedia.org/wiki/Memcached [14:22:26] PROBLEM - Check systemd state on parse1010 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:56] PROBLEM - mediawiki-installation DSH group on parse1024 is CRITICAL: Host parse1024 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:22:56] PROBLEM - memcached socket on parse1017 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory https://wikitech.wikimedia.org/wiki/Memcached [14:23:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator, gerrit-root, phabricator-roots for jhuneidi - https://phabricator.wikimedia.org/T316521 (10Jelto) p:05Triage→03Medium a:03Jelto Thanks for opening the request! Please note the check boxes should be marked by SRE... [14:23:30] PROBLEM - Apache HTTP on parse1007 is CRITICAL: connect to address 10.64.16.48 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [14:24:12] RECOVERY - memcached socket on parse1001 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [14:24:42] (03PS1) 10David Caro: pylint: add timeouts to requests.* calls [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 [14:25:42] (03CR) 10David Caro: "Note that the timeouts are just high enough (imo), but feel free to let me know if they should have other values." [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 (owner: 10David Caro) [14:26:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deploy-phabricator, gerrit-root for dduvall - https://phabricator.wikimedia.org/T316526 (10Jelto) p:05Triage→03Medium a:03Jelto Thanks for opening the request! Please note the check boxes should be marked by SRE on Clinic Duty. I... [14:26:47] (03PS2) 10David Caro: pylint: add timeouts to requests.* calls [cookbooks] - 10https://gerrit.wikimedia.org/r/827508 [14:26:49] parse errors are me, sorry [14:26:59] not in production, they're new hosts [14:28:13] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1032.eqiad.wmnet with OS buster [14:29:18] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [14:29:34] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01017 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:30:38] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops, 10Community-Tech (CommTech-Sprint-32): SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10JMcLeod_WMF) [14:31:58] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10SRE Observability: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229 (10ayounsi) [14:32:09] 10SRE, 10Structured-Data-Backlog, 10Wikimedia-Mailing-lists: Create sd-alerts@lists.wikimedia.org mailing list - https://phabricator.wikimedia.org/T316543 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Created and added you as owner: https://lists.wikimedia.org/postorius/lists/sd-alerts.lists.wikimedia.... [14:32:14] (03CR) 10Jelto: [C: 03+2] "looks good to me, I'll proceed here soon" [puppet] - 10https://gerrit.wikimedia.org/r/827494 (https://phabricator.wikimedia.org/T316528) (owner: 10Thcipriani) [14:32:17] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Upgrade etherpad.wikimedia.org to (more) recent Etherpad version with more rich end-user features - https://phabricator.wikimedia.org/T316421 (10akosiaris) p:05Triage→03Low I am not sure I see what are the extra features either. Changelog (@JeanFred is correct r... [14:32:44] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [14:32:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T316186)', diff saved to https://phabricator.wikimedia.org/P33605 and previous config saved to /var/cache/conftool/dbconfig/20220829-143255-ladsgroup.json [14:33:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:33:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:33:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T316186)', diff saved to https://phabricator.wikimedia.org/P33606 and previous config saved to /var/cache/conftool/dbconfig/20220829-143319-ladsgroup.json [14:36:07] (03PS1) 10Alexandros Kosiaris: etherpad: Add a link to CoC in the defaultPadText [puppet] - 10https://gerrit.wikimedia.org/r/827512 (https://phabricator.wikimedia.org/T136744) [14:38:28] (03PS1) 10Clément Goubert: parsoid: Add parse1* servers to dsh [puppet] - 10https://gerrit.wikimedia.org/r/827513 (https://phabricator.wikimedia.org/T312638) [14:39:43] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Routing loop for unused WMCS IPs in 185.15.56.0/24 - https://phabricator.wikimedia.org/T315956 (10cmooney) a:03cmooney Yeah I hadn't considered that when I made the change originally. Figured it was a simplification but the routing loop is... [14:40:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T316186)', diff saved to https://phabricator.wikimedia.org/P33607 and previous config saved to /var/cache/conftool/dbconfig/20220829-144030-ladsgroup.json [14:40:42] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1032.eqiad.wmnet with reason: host reimage [14:41:02] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on restbase1031.eqiad.wmnet with reason: New host [14:41:05] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on restbase1031.eqiad.wmnet with reason: New host [14:42:52] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005817 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:43:27] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1032.eqiad.wmnet with reason: host reimage [14:46:47] Widespread puppet failures was also me, puppet didn't cleanly pass on first run on new parse hosts [14:46:57] sorry 'bout the noise [14:47:52] 10SRE, 10Structured-Data-Backlog, 10Wikimedia-Mailing-lists: Create sd-alerts@lists.wikimedia.org mailing list - https://phabricator.wikimedia.org/T316543 (10CBogen) >>! In T316543#8194314, @Ladsgroup wrote: > Created and added you as owner: https://lists.wikimedia.org/postorius/lists/sd-alerts.lists.wikimed... [14:48:12] (03CR) 10Aqu: "@aotta @btullis This is ready to merge, after pushing the last version of conda-analytics (0.0.8) on apt.wm.org" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [14:49:46] (03PS2) 10Clément Goubert: parsoid: Add parse1* servers to dsh [puppet] - 10https://gerrit.wikimedia.org/r/827513 (https://phabricator.wikimedia.org/T312638) [14:52:09] RECOVERY - PHP7 rendering on parse1017 is OK: HTTP OK: HTTP/1.1 302 Found - 519 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:53:45] RECOVERY - PHP7 rendering on parse1004 is OK: HTTP OK: HTTP/1.1 302 Found - 520 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:54:13] RECOVERY - PHP7 rendering on parse1024 is OK: HTTP OK: HTTP/1.1 302 Found - 520 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:54:59] RECOVERY - PHP7 rendering on parse1008 is OK: HTTP OK: HTTP/1.1 302 Found - 520 bytes in 0.540 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P33608 and previous config saved to /var/cache/conftool/dbconfig/20220829-145536-ladsgroup.json [14:58:25] (03CR) 10Dzahn: [C: 03+2] "oh, already upgraded! cool:)" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [14:59:13] RECOVERY - Apache HTTP on parse1007 is OK: HTTP OK: HTTP/1.1 302 Found - 505 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:01:29] RECOVERY - Check systemd state on parse1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:06] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10gmodena) Hey @Jelto - it's a notebook like the one described in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#... [15:05:02] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Upgrade etherpad.wikimedia.org to (more) recent Etherpad version with more rich end-user features - https://phabricator.wikimedia.org/T316421 (10JeanFred) Sounds to me that this task should be split up: * renaming this one to “Minor upgrade of Etherpad from 1.8.16 t... [15:09:50] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10Papaul) 05Open→03Resolved @Marostegui disk replaced ` Solid State Disk 0:1:3 Online 3 1787.88 GB Not Capable SATA SSD No 100% [15:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P33609 and previous config saved to /var/cache/conftool/dbconfig/20220829-151042-ladsgroup.json [15:13:46] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1032.eqiad.wmnet with OS buster [15:14:50] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1033.eqiad.wmnet with OS buster [15:16:33] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) Per the output of `swiftly head rdf-streaming-updater-codfw`, the `rdf-streaming-updater-codfw` swif... [15:18:12] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to Java 11 - https://phabricator.wikimedia.org/T316103 (10Gehel) p:05Triage→03High [15:18:16] 10SRE, 10ops-codfw, 10DBA: db2149 is sad after reboot - https://phabricator.wikimedia.org/T316494 (10Papaul) @Marostegui ` Solid State Disk 0:1:5 Failed 5 1787.88 GB Not Capable SATA SSD No 100% [15:18:24] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10bking) [15:19:07] 10SRE, 10ops-codfw, 10DBA: db2149 is sad after reboot - https://phabricator.wikimedia.org/T316494 (10Papaul) Since the server is under warranty I will ask for a replacement. [15:19:17] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) 05Open→03Resolved p:05Triage→03Medium [15:19:52] (03PS2) 10Vgutierrez: trafficserver: Set action=never-cache for caching=websockets|pipe [puppet] - 10https://gerrit.wikimedia.org/r/827506 (https://phabricator.wikimedia.org/T316545) [15:20:56] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37031/console" [puppet] - 10https://gerrit.wikimedia.org/r/827506 (https://phabricator.wikimedia.org/T316545) (owner: 10Vgutierrez) [15:21:19] 10SRE, 10Traffic, 10Patch-For-Review: ATS isn't honoring the cache policy set in cache::alternate_domains on some cases - https://phabricator.wikimedia.org/T316545 (10Vgutierrez) [15:21:35] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10Gehel) [15:21:45] 10SRE, 10Traffic, 10Patch-For-Review: ATS isn't honoring the cache policy set in cache::alternate_domains on some cases - https://phabricator.wikimedia.org/T316545 (10Vgutierrez) p:05Triage→03Medium a:03Vgutierrez [15:22:40] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10Gehel) 05Resolved→03Open [15:25:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T316186)', diff saved to https://phabricator.wikimedia.org/P33610 and previous config saved to /var/cache/conftool/dbconfig/20220829-152549-ladsgroup.json [15:25:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [15:26:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [15:26:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33611 and previous config saved to /var/cache/conftool/dbconfig/20220829-152612-ladsgroup.json [15:27:18] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage [15:27:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33612 and previous config saved to /var/cache/conftool/dbconfig/20220829-152741-ladsgroup.json [15:30:05] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T1530). [15:31:56] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage [15:34:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33613 and previous config saved to /var/cache/conftool/dbconfig/20220829-153440-ladsgroup.json [15:38:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10Papaul) 05Open→03Resolved Disk replaced [15:44:35] (03CR) 10Jforrester: TranslatableBundleLogFormatter: Cast reason to string before passing it (031 comment) [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824442 (https://phabricator.wikimedia.org/T315657) (owner: 10Jforrester) [15:44:38] (03Abandoned) 10Jforrester: TranslatableBundleLogFormatter: Cast reason to string before passing it [extensions/Translate] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824442 (https://phabricator.wikimedia.org/T315657) (owner: 10Jforrester) [15:45:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:46:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:46:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:47:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:48:24] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:49:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P33614 and previous config saved to /var/cache/conftool/dbconfig/20220829-154946-ladsgroup.json [15:52:09] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Upgrade etherpad.wikimedia.org to (more) recent Etherpad version with more rich end-user features - https://phabricator.wikimedia.org/T316421 (10akosiaris) >>! In T316421#8194578, @JeanFred wrote: > Sounds to me that this task should be split up: > * renaming this o... [15:53:55] (03PS3) 10Clément Goubert: parsoid: Add parse1* servers to conftool [puppet] - 10https://gerrit.wikimedia.org/r/827513 (https://phabricator.wikimedia.org/T312638) [15:54:02] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:55:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] parsoid: Add parse1* servers to conftool [puppet] - 10https://gerrit.wikimedia.org/r/827513 (https://phabricator.wikimedia.org/T312638) (owner: 10Clément Goubert) [15:56:07] 10SRE, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10dancy) [15:56:32] (03CR) 10Clément Goubert: [C: 03+2] parsoid: Add parse1* servers to conftool [puppet] - 10https://gerrit.wikimedia.org/r/827513 (https://phabricator.wikimedia.org/T312638) (owner: 10Clément Goubert) [16:00:54] 10SRE, 10ops-codfw, 10DBA: db2149 is sad after reboot - https://phabricator.wikimedia.org/T316494 (10Papaul) Create Dispatch: Success You have successfully submitted request SR150174485. [16:01:12] <_joe_> jouncebot: now [16:01:12] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [16:01:16] <_joe_> claime: ^^ [16:01:36] <_joe_> claime: that's the quickest away [16:02:00] !log cgoubert@puppetmaster1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1001.eqiad.wmnet [16:02:21] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=parsoid,name=parse1001.eqiad.wmnet [16:02:41] (03CR) 10Krinkle: [C: 03+2] Remove redundant $wgLanguageConverterCacheType CLI override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822114 (owner: 10Krinkle) [16:03:25] * Krinkle staging on mwdebug1002 [16:04:02] (03PS2) 10Krinkle: Remove redundant $wgLanguageConverterCacheType CLI override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822114 [16:04:07] (03CR) 10Krinkle: [C: 03+2] Remove redundant $wgLanguageConverterCacheType CLI override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822114 (owner: 10Krinkle) [16:04:31] (03PS2) 10Krinkle: Explicitly set wgMessageCacheType=mcrouter (avoid newAnything in prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822119 (https://phabricator.wikimedia.org/T186673) [16:04:45] (03CR) 10Majavah: [C: 03+2] Use shell webservice-runner for python35/python37 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827007 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [16:04:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P33615 and previous config saved to /var/cache/conftool/dbconfig/20220829-160452-ladsgroup.json [16:04:55] (03Merged) 10jenkins-bot: Remove redundant $wgLanguageConverterCacheType CLI override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822114 (owner: 10Krinkle) [16:05:29] (03Merged) 10jenkins-bot: Use shell webservice-runner for python35/python37 images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827007 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [16:05:30] ACKNOWLEDGEMENT - MegaRAID on db2149 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T316565 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:05:37] 10SRE, 10ops-codfw: Degraded RAID on db2149 - https://phabricator.wikimedia.org/T316565 (10ops-monitoring-bot) [16:05:45] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1033.eqiad.wmnet with OS buster [16:05:45] (03CR) 10Majavah: "Do we want to set the PORT env variable here?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/827009 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [16:08:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:08:10] !log pooled parse1001.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638 [16:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:11] (03PS1) 10Snwachukwu: Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) [16:11:00] (03CR) 10CI reject: [V: 04-1] Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [16:11:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:11:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:12:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:12:49] !log depooled wtp1034.eqiad.wmnet from parsoid cluster https://phabricator.wikimedia.org/T312638 [16:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:05] (03PS2) 10Snwachukwu: Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) [16:14:41] (03CR) 10jenkins-bot: Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [16:16:01] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:40] (03PS1) 10Giuseppe Lavagetto: Update wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827528 [16:17:59] <_joe_> jouncebot: next [16:17:59] In 0 hour(s) and 42 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T1700) [16:18:09] <_joe_> ok, I can just deploy it freely [16:19:03] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827528 (owner: 10Giuseppe Lavagetto) [16:19:51] RECOVERY - mediawiki-installation DSH group on parse1001 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:19:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Update wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827528 (owner: 10Giuseppe Lavagetto) [16:19:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T316186)', diff saved to https://phabricator.wikimedia.org/P33616 and previous config saved to /var/cache/conftool/dbconfig/20220829-161959-ladsgroup.json [16:20:41] (03Merged) 10jenkins-bot: Update wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827528 (owner: 10Giuseppe Lavagetto) [16:22:23] <_joe_> Krinkle: are you deploying something right now? [16:22:34] <_joe_> oh-uhm [16:23:28] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Rotate logs as the www-data user and librenms group. [puppet] - 10https://gerrit.wikimedia.org/r/827450 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [16:23:34] <_joe_> claime: well I guess we need to depool parse1001 then. [16:23:52] <_joe_> we can't deploy our change right now, because a lock has been taken [16:24:03] I see [16:24:44] !log repooled wtp1034.eqiad.wmnet and depooled parse1001.eqiad.wmnet [16:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:13] (03CR) 10Dzahn: "thanks! so we are now running it daily - https://gerrit.wikimedia.org/r/825424" [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [16:25:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33617 and previous config saved to /var/cache/conftool/dbconfig/20220829-162516-ladsgroup.json [16:25:44] jouncebot: now [16:25:45] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [16:27:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:28:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:28:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:29:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:29:42] (03PS1) 10Zabe: Add missing comma [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827199 (https://phabricator.wikimedia.org/T316150) [16:30:22] _joe_: ack, I'll unlock [16:30:28] zabe: need a deployer? [16:30:41] _joe_: am I okay to sync-file now, or want me to undo/ [16:31:07] Lucas_WMDE, yeah [16:31:43] but I think _jo.e_ is currently deploying [16:31:58] ack [16:33:00] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM" [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827199 (https://phabricator.wikimedia.org/T316150) (owner: 10Zabe) [16:34:09] !log krinkle@deploy1002 sync-file aborted: (no justification provided) (duration: 00m 01s) [16:35:15] _joe_: ok, apparently I'm rsyncing your change now. I didn't realize it was git-pulled already [16:35:26] I dont know if it also ended up covered by my mwdebug stage [16:37:00] 5000 errors in the logs past hour for: [{reqId}] {exception_url} PHP Warning: Erroneous data format for unserializing 'Wikimedia\Rdbms\MySQLPrimaryPos' [16:37:45] how is that not triggering an alert? [16:37:54] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:38:15] !log krinkle@deploy1002 Synchronized wmf-config/: I23c22105bb0062116 (duration: 03m 57s) [16:39:06] started 08:30 UTC today [16:39:08] 10SRE, 10ops-codfw, 10DBA: db2149 is sad after reboot - https://phabricator.wikimedia.org/T316494 (10Ladsgroup) Thanks @Papaul [16:39:19] https://logstash.wikimedia.org/goto/659487e340575d346090e4b26cb658ba [16:39:56] Amir1: ^ [16:39:56] RECOVERY - Check systemd state on cp6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:19] let me see [16:40:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P33618 and previous config saved to /var/cache/conftool/dbconfig/20220829-164022-ladsgroup.json [16:40:49] claime: The change you and _joe_ prepped is now live. I noticed the git-log half-way through the scap sync command in a second tab (it wasn't there when I checked it before starting the sync). [16:41:07] Krinkle: isn't it php 7.4? [16:41:08] I can revert it if you like. Let me know. I'm not seeing any obvious errors in the logs. [16:41:25] (03CR) 10Bearloga: r_lang: Switch from devtools to remotes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817907 (owner: 10Bearloga) [16:41:26] all 16,000 errors are phpversion: 7.2 [16:41:49] (03CR) 10Bearloga: shiny_server: Minimal dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817903 (owner: 10Bearloga) [16:42:16] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:42:20] that's weird. Time-wise matches that but yeah, maybe a mismatch between php 7.2 and 7.4 [16:43:12] ack, php74 storing data that php72 gets confused by? [16:46:09] * Krinkle checks 3v4l [16:46:24] I’m heading out for today – once the current issues are done, Someone™ should deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/827199 [16:48:17] Krinkle: yeah, that's my guess, I haven't deployed anything else around that time [16:49:05] (I deployed stop writing to templatelinks in commons which in no way can cause issues like this) [16:49:24] https://3v4l.org/fJY27 [16:49:47] there's definitely a serialization difference between php7.1-7.3 and 7.4-8.1 [16:50:40] I don't know if that's universal or not. I imagine there's not many things we serialize as PHP into memcached and that are popular enough to stand out in logs like this [16:50:51] could be something specific about this class. It does have custom serialization logic [16:51:00] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:51:58] (03PS1) 10Jdlrobson: Revert "Enable new Vector skin on select pages (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827200 (https://phabricator.wikimedia.org/T309973) [16:52:07] (03PS2) 10Jdlrobson: Revert "Enable new Vector skin on select pages (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827200 (https://phabricator.wikimedia.org/T309973) [16:53:00] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:54:42] (03PS1) 10Krinkle: Revert "Update wgLinterSubmitterWhitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827201 [16:55:06] _joe_: claime: I need to leave. erring on side of caution if not ack'ed. I'll revert? [16:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P33619 and previous config saved to /var/cache/conftool/dbconfig/20220829-165529-ladsgroup.json [16:55:40] (03CR) 10Krinkle: [C: 03+2] Revert "Update wgLinterSubmitterWhitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827201 (owner: 10Krinkle) [16:56:00] Krinkle: no worries it's just a whitelist of a server we depooled when we saw the change couldn't be deployed [16:56:11] There's a bunch of issues on-going that need to be dealt with first probably. including the possibly-php74-induced incident that's affecting MySQLPos objects and causing 16,000 errors today so far. [16:56:29] claime: ok, so safe to keep? [16:56:30] I can't followup though [16:56:50] Not enough information to confirm Krinkle, it's one of my first changes there [16:57:48] the likely cause is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/823675 [16:58:15] I guess we would need to know: Does it control what's used or what's accepted. If what's used: are the added servers ready for use. If what's accepted: Are the removed entries no longer pooled? I don't know Linter extension well to decide in negative 180s ETA what to do. WIll revert [16:58:17] Sorry for the mess. [16:58:24] (03CR) 10Krinkle: [V: 03+2 C: 03+2] Revert "Update wgLinterSubmitterWhitelist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827201 (owner: 10Krinkle) [16:58:33] There are no removed entries it's whitespace [16:59:43] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/204a546159e407b51550ded2a43215907d7faae8%5E%21/#F0 [16:59:54] RECOVERY - Check systemd state on dse-k8s-worker1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:58] I prefer fixing MySQLPos to be compatible cross php versions, if it's relying on load monitor, we can disable load monitor during the transition period, at its current shape it's completely useless [17:00:01] I don't see a pair of rm/add lines. It's adding and removig actual entries to the best of my knowledge. [17:00:04] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T1700). [17:00:42] Krinkle: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/827528 [17:01:03] Oh right it's removing a bunch of entries actually yeah [17:01:40] claime: sorry again for the mess. two unplanned casual deploys conflicting. Belated welcome :) [17:01:48] let me check [17:01:55] Krinkle: thanks [17:02:08] I gotta run now. I'm waiting for the revert to finish. Will be a good excercise to re-do even if uncontroversial. [17:02:33] !log krinkle@deploy1002 Synchronized wmf-config/: I1f79f21cbf8 (duration: 03m 42s) [17:02:54] Amir1: agreed. [17:03:00] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on restbase[1031-1033].eqiad.wmnet with reason: New hosts - awaiting cassandra joins [17:03:04] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on restbase[1031-1033].eqiad.wmnet with reason: New hosts - awaiting cassandra joins [17:03:17] Krinkle: Yeah, I'm gonna have to go too, although I can tell you that the two first servers removed I checked are also removed from conftool [17:03:37] I think joe took the opportunity for a cleanup [17:03:40] Amir1: can you assess impact of the current serialization warning? I "hope" it all degrades to e.g. waitFor and ChronologyProtector no-oping, which is fine-ish for CP, not sure how fine it is for waitFor as that could cause bad side-effects. [17:04:12] inc increased load during jobs and maint scripts [17:04:29] Krinkle: parse1001 is the "real" change, whitelisting the new parsoid server in 7.4 [17:04:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:04:47] I haven't seen any users complaining, my scripts are running fine but let me check mysql graphs [17:04:54] but yeah, we should try to e.g. extract into a string or array and store that in memc instead of as PHP object [17:05:03] * Krinkle is leaving now [17:05:25] * claime same, especially since I can't help with debugging this at all yet [17:05:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:05:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:05:55] I'm not seeing any impact on scripts [17:06:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:06:48] (03CR) 10Cwhite: [C: 03+2] logstash: alerts to use yearly rotation [puppet] - 10https://gerrit.wikimedia.org/r/826385 (https://phabricator.wikimedia.org/T304924) (owner: 10Cwhite) [17:06:54] PROBLEM - Check systemd state on dse-k8s-worker1005 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:24] (03PS1) 10Bernard Wang: Fix site notice spacing [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827534 (https://phabricator.wikimedia.org/T315595) [17:08:25] (03PS3) 10Clare Ming: Revert "Enable new Vector skin on select pages (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827200 (https://phabricator.wikimedia.org/T309973) (owner: 10Jdlrobson) [17:08:33] (03CR) 10Clare Ming: [C: 03+1] Revert "Enable new Vector skin on select pages (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827200 (https://phabricator.wikimedia.org/T309973) (owner: 10Jdlrobson) [17:09:24] jouncebot: nowandnext [17:09:24] For the next 0 hour(s) and 20 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T1700) [17:09:25] In 2 hour(s) and 50 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T2000) [17:09:30] (03CR) 10Urbanecm: [C: 03+2] "UBN" [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827199 (https://phabricator.wikimedia.org/T316150) (owner: 10Zabe) [17:09:36] (03PS1) 10Bking: deployment-prep: change ES version from 6 to 7 [puppet] - 10https://gerrit.wikimedia.org/r/827535 (https://phabricator.wikimedia.org/T316240) [17:10:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33620 and previous config saved to /var/cache/conftool/dbconfig/20220829-171035-ladsgroup.json [17:10:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [17:10:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [17:10:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:11:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:11:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T316186)', diff saved to https://phabricator.wikimedia.org/P33621 and previous config saved to /var/cache/conftool/dbconfig/20220829-171116-ladsgroup.json [17:11:24] thanks urbanecm [17:12:20] (03Merged) 10jenkins-bot: Add missing comma [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827199 (https://phabricator.wikimedia.org/T316150) (owner: 10Zabe) [17:15:30] <_joe_> Krinkle: did you revert? [17:16:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:16:50] _joe_: Timo is already out [17:16:53] I'm looking [17:17:14] (for the MySQLPos issue, the other one got reverted) [17:17:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:17:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:18:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T316186)', diff saved to https://phabricator.wikimedia.org/P33622 and previous config saved to /var/cache/conftool/dbconfig/20220829-171839-ladsgroup.json [17:18:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:18:57] <_joe_> no need to revert [17:21:51] * urbanecm is deploying an UBN fix [17:22:02] (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/827199) [17:25:26] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/SecurePoll/includes/Pages/VoterEligibilityPage.php: 2d6c378fe509551607c382f96adf1c4fa4c4bad2: Add missing comma (T316150) (duration: 03m 47s) [17:25:31] * urbanecm is done [17:25:32] T316150: Adding a global account to override list gives DBTransactionSizeError - https://phabricator.wikimedia.org/T316150 [17:31:54] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10dcausse) @bking thanks for running the cleanup! I can confirm that the `wikidata` and `commons` pseudo-fol... [17:33:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P33623 and previous config saved to /var/cache/conftool/dbconfig/20220829-173345-ladsgroup.json [17:38:29] it seems it's affecting CP only [17:38:36] (chronology protector) [17:43:13] (03PS1) 10Bartosz Dziewoński: Fix boilerplate in maintenance scripts for WMF production [extensions/DiscussionTools] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827202 (https://phabricator.wikimedia.org/T316548) [17:45:38] (03CR) 10Ebernhardson: "I'm not clear on how the ssl key is related to the 6->7 upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/827535 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking) [17:48:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P33624 and previous config saved to /var/cache/conftool/dbconfig/20220829-174851-ladsgroup.json [17:52:27] (03PS1) 10Gergő Tisza: Restore auth request ID from before namespacing [extensions/ConfirmEdit] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827203 (https://phabricator.wikimedia.org/T316410) [18:00:53] (03Restored) 10Gergő Tisza: Fix WelcomeSurvey CentralAuthPostLoginRedirect hook [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827191 (https://phabricator.wikimedia.org/T315583) (owner: 10Kosta Harlan) [18:01:17] (03PS2) 10Gergő Tisza: Fix WelcomeSurvey CentralAuthPostLoginRedirect hook [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827191 (https://phabricator.wikimedia.org/T315583) (owner: 10Kosta Harlan) [18:03:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T316186)', diff saved to https://phabricator.wikimedia.org/P33625 and previous config saved to /var/cache/conftool/dbconfig/20220829-180358-ladsgroup.json [18:04:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:04:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:04:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T316186)', diff saved to https://phabricator.wikimedia.org/P33626 and previous config saved to /var/cache/conftool/dbconfig/20220829-180421-ladsgroup.json [18:11:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T316186)', diff saved to https://phabricator.wikimedia.org/P33627 and previous config saved to /var/cache/conftool/dbconfig/20220829-181140-ladsgroup.json [18:12:08] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Routing loop for unused WMCS IPs in 185.15.56.0/24 - https://phabricator.wikimedia.org/T315956 (10cmooney) I've made these changes now manually. One change to note is the included ASNs of eBGP peers cloudsw1-e4 and cloudsw1-f4 on the aggrega... [18:14:06] (03CR) 10Dzahn: [C: 03+2] doc: properly redirect back compat URLs [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [18:22:14] (03PS1) 10Cathal Mooney: Include routes from proto aggregate in Cloud VRF out filter to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/827542 (https://phabricator.wikimedia.org/T315956) [18:24:40] (03PS1) 10Jgreen: Add frdb1006 to icinga/nsca monitoring [puppet] - 10https://gerrit.wikimedia.org/r/827543 (https://phabricator.wikimedia.org/T312584) [18:26:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P33628 and previous config saved to /var/cache/conftool/dbconfig/20220829-182646-ladsgroup.json [18:28:28] (03PS2) 10Bking: deployment-prep: change ES version from 6 to 7 [puppet] - 10https://gerrit.wikimedia.org/r/827535 (https://phabricator.wikimedia.org/T316240) [18:29:46] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:29:55] (03CR) 10Bking: deployment-prep: change ES version from 6 to 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/827535 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking) [18:30:13] (03CR) 10Cathal Mooney: [C: 03+2] Include routes from proto aggregate in Cloud VRF out filter to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/827542 (https://phabricator.wikimedia.org/T315956) (owner: 10Cathal Mooney) [18:30:54] (03Merged) 10jenkins-bot: Include routes from proto aggregate in Cloud VRF out filter to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/827542 (https://phabricator.wikimedia.org/T315956) (owner: 10Cathal Mooney) [18:31:14] (03CR) 10Jgreen: [C: 03+2] Add frdb1006 to icinga/nsca monitoring [puppet] - 10https://gerrit.wikimedia.org/r/827543 (https://phabricator.wikimedia.org/T312584) (owner: 10Jgreen) [18:32:19] (03CR) 10Dzahn: [C: 03+2] "I deployed the test change first. Test results before making apache change:" [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [18:32:32] (03CR) 10Ebernhardson: [C: 03+1] deployment-prep: change ES version from 6 to 7 [puppet] - 10https://gerrit.wikimedia.org/r/827535 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking) [18:33:54] (03CR) 10Dzahn: [C: 03+2] "I deployed the test change first. The result _before_ making the Apache change passes though, which should not be the base if current conf" [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [18:35:34] (03CR) 10Bking: [C: 03+2] deployment-prep: change ES version from 6 to 7 [puppet] - 10https://gerrit.wikimedia.org/r/827535 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking) [18:38:13] (03CR) 10Dzahn: [C: 03+2] "ah, no, my bad. all is good! I just did not deploy like I meant to. On doc1002 it was already applied and tests pass and on doc2001 I coul" [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [18:41:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P33629 and previous config saved to /var/cache/conftool/dbconfig/20220829-184153-ladsgroup.json [18:52:13] jouncebot: nowandnext [18:52:13] No deployments scheduled for the next 1 hour(s) and 7 minute(s) [18:52:13] In 1 hour(s) and 7 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T2000) [18:54:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:56:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T316186)', diff saved to https://phabricator.wikimedia.org/P33630 and previous config saved to /var/cache/conftool/dbconfig/20220829-185659-ladsgroup.json [18:57:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [18:57:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [18:57:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T316186)', diff saved to https://phabricator.wikimedia.org/P33631 and previous config saved to /var/cache/conftool/dbconfig/20220829-185723-ladsgroup.json [18:57:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:57:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:58:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:03:19] (03PS3) 10Dzahn: admin: add reserved gid 920 for phd, phabricator user [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) [19:03:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:04:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:04:08] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! I believe should address the issue I mentioned in the task." [dns] - 10https://gerrit.wikimedia.org/r/827446 (owner: 10Majavah) [19:04:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:04:11] (03CR) 10CI reject: [V: 04-1] admin: add reserved gid 920 for phd, phabricator user [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [19:04:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T316186)', diff saved to https://phabricator.wikimedia.org/P33632 and previous config saved to /var/cache/conftool/dbconfig/20220829-190444-ladsgroup.json [19:04:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:05:17] (03CR) 10Dzahn: "broken rebase - let me see if I can merge Denisse's change first" [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [19:05:40] (03PS1) 10Stang: logos: Raise error if logo is too tall [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827546 (https://phabricator.wikimedia.org/T310961) [19:06:34] (03CR) 10Dzahn: "maybe the task could be linked here. A little context what problems this caused would be interested to read." [dns] - 10https://gerrit.wikimedia.org/r/827446 (owner: 10Majavah) [19:08:54] (03CR) 10Dzahn: [C: 03+2] "being bold and merging this. already had a +1 in the past and I have another change rebased on top" [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [19:10:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:10:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:10:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:11:05] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) [19:11:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:13:15] (03CR) 10Dzahn: [C: 03+2] "on netmon1002, netmon1003, netmon2002: confirmed noop and uid/gid is already 921:921 on all of them as well" [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [19:17:08] (03PS4) 10Dzahn: admin: add reserved gid 920 for phd, phabricator user [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) [19:18:43] (03PS5) 10Dzahn: admin: add reserved gid 920 for phd, phabricator user [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) [19:19:50] (03CR) 10Dzahn: [C: 03+2] "added a reason here as well, per the comment in librenms change" [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [19:19:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P33633 and previous config saved to /var/cache/conftool/dbconfig/20220829-191950-ladsgroup.json [19:24:43] (03PS2) 10Bartosz Dziewoński: Enable reply tool by default on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748765 (https://phabricator.wikimedia.org/T297533) (owner: 10Esanders) [19:24:45] (03PS1) 10Bartosz Dziewoński: Make DiscussionTools topicsubscription, autotopicsub opt-out on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827548 (https://phabricator.wikimedia.org/T315714) [19:26:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P33634 and previous config saved to /var/cache/conftool/dbconfig/20220829-192608-ladsgroup.json [19:26:18] (03CR) 10Dzahn: "Ideal would be if the perf-team can work directly with observability to get this tested and deployed. It was meant to be a pointer how to " [puppet] - 10https://gerrit.wikimedia.org/r/823737 (https://phabricator.wikimedia.org/T277927) (owner: 10Dzahn) [19:27:05] (03CR) 10Dzahn: [C: 03+2] wikistats: run updates of WMF-operated wikis earlier in the day [puppet] - 10https://gerrit.wikimedia.org/r/826394 (https://phabricator.wikimedia.org/T315121) (owner: 10Dzahn) [19:27:10] (03PS2) 10Dzahn: wikistats: run updates of WMF-operated wikis earlier in the day [puppet] - 10https://gerrit.wikimedia.org/r/826394 (https://phabricator.wikimedia.org/T315121) [19:28:52] (03PS3) 10Gergő Tisza: Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 1) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827191 (https://phabricator.wikimedia.org/T315583) (owner: 10Kosta Harlan) [19:28:54] (03PS1) 10Gergő Tisza: Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 2) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827549 (https://phabricator.wikimedia.org/T315583) [19:33:25] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@5c0af35]: Update to work with elasticsearch 7.x [19:34:19] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@5c0af35]: Update to work with elasticsearch 7.x (duration: 00m 54s) [19:44:15] (03PS5) 10Gergő Tisza: Declare mediawiki.accountcreation_block stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [19:52:10] (03CR) 10Kosta Harlan: [C: 03+1] Declare mediawiki.accountcreation_block stream (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [19:54:29] (03PS1) 10Andrea Denisse: netmon: Add execution permission for others in the rancid directory. [puppet] - 10https://gerrit.wikimedia.org/r/827553 (https://phabricator.wikimedia.org/T316569) [20:00:00] (03PS2) 10Andrea Denisse: netmon: Add the wikidev group for the rancid directory. [puppet] - 10https://gerrit.wikimedia.org/r/827553 (https://phabricator.wikimedia.org/T316569) [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T2000) [20:00:05] cjming, MatmaRex, tgr, and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:31] o/ hi - i can deploy [20:00:32] o/ [20:00:47] o/ [20:00:51] * TheresNoTime around [20:00:59] i'll go in order and start with my patches [20:00:59] koi: sorry for skipping the line, I added some patches that were relatively urgent [20:01:00] hi [20:01:15] (03CR) 10Clare Ming: [C: 03+2] Revert "Enable new Vector skin on select pages (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827200 (https://phabricator.wikimedia.org/T309973) (owner: 10Jdlrobson) [20:01:30] tgr: it's ok, I could wait for next window, mine is not that urgent [20:01:35] (03CR) 10Clare Ming: [C: 03+2] Fix site notice spacing [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827534 (https://phabricator.wikimedia.org/T315595) (owner: 10Bernard Wang) [20:02:02] koi: let's see where we end up - happy to do your patches if the window doesn't go too late [20:02:04] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1003/37033/" [puppet] - 10https://gerrit.wikimedia.org/r/827553 (https://phabricator.wikimedia.org/T316569) (owner: 10Andrea Denisse) [20:02:16] got it, thanks! [20:02:37] (03CR) 10Clare Ming: [C: 03+2] Fix boilerplate in maintenance scripts for WMF production [extensions/DiscussionTools] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827202 (https://phabricator.wikimedia.org/T316548) (owner: 10Bartosz Dziewoński) [20:02:39] (03Merged) 10jenkins-bot: Revert "Enable new Vector skin on select pages (take 2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827200 (https://phabricator.wikimedia.org/T309973) (owner: 10Jdlrobson) [20:02:48] (03CR) 10Clare Ming: [C: 03+2] Restore auth request ID from before namespacing [extensions/ConfirmEdit] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827203 (https://phabricator.wikimedia.org/T316410) (owner: 10Gergő Tisza) [20:03:07] (03CR) 10Clare Ming: [C: 03+2] Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 1) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827191 (https://phabricator.wikimedia.org/T315583) (owner: 10Kosta Harlan) [20:04:46] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@57fb704]: Deploy mjolnir 1.1 for elasticsearch 7.x compatability [20:04:58] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@57fb704]: Deploy mjolnir 1.1 for elasticsearch 7.x compatability (duration: 00m 11s) [20:05:09] (03PS1) 10BBlack: Revert "trafficserver: Hide non session cookies during cache lookup" [puppet] - 10https://gerrit.wikimedia.org/r/827566 [20:05:17] (03PS2) 10BBlack: Revert "trafficserver: Hide non session cookies during cache lookup" [puppet] - 10https://gerrit.wikimedia.org/r/827566 [20:07:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:40] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "trafficserver: Hide non session cookies during cache lookup" [puppet] - 10https://gerrit.wikimedia.org/r/827566 (owner: 10BBlack) [20:08:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:08:57] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:827200|Revert "Enable new Vector skin on select pages (take 2)" (T309973)]] (duration: 03m 34s) [20:09:03] T309973: [Goal] Provide a sneak preview of the new experience and run survey using a banner linking to a page displayed in vector 2022 - https://phabricator.wikimedia.org/T309973 [20:12:48] RECOVERY - MegaRAID on db2110 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:13:50] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@57fb704]: Deploy mjolnir 1.1 for elasticsearch 7.x compatability [20:14:14] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@57fb704]: Deploy mjolnir 1.1 for elasticsearch 7.x compatability (duration: 00m 24s) [20:14:51] !log Revert of cookie-related changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/827566/ pushing to all cp-text [20:14:51] (03Merged) 10jenkins-bot: Fix boilerplate in maintenance scripts for WMF production [extensions/DiscussionTools] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827202 (https://phabricator.wikimedia.org/T316548) (owner: 10Bartosz Dziewoński) [20:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:19:08] does anyone know if flaky selenium tests prevent merging? i feel like they shouldn't but maybe they do - we'll find out soon enough [20:20:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:20:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:20:45] (03CR) 10CI reject: [V: 04-1] Fix site notice spacing [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827534 (https://phabricator.wikimedia.org/T315595) (owner: 10Bernard Wang) [20:21:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:22:19] 10SRE, 10Traffic: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10bd808) >>! In T316338#8193037, @gerritbot wrote: > Change 826785 **merged** by Vgutierrez: > %%%[operations/puppet@production] trafficserver: Hide non session cookies during cache lookup%%% > h... [20:22:27] MatmaRex: gonna go ahead with your patches since the vector one i'm trying to do is giving grief [20:22:50] D: thanks [20:24:28] MatmaRex: 827202 on mwdebug1002 if you can check [20:25:07] (03PS1) 10Herron: WIP: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) [20:25:11] cjming: not really, it's just a maintenance script [20:25:15] depends on the repo, some run selenium test on every patch revision, some before merge, some post-merge. Also some are voting and some not. [20:25:31] the expected effect of the change is that persistRevisionThreadItems.php will run now [20:25:51] MatmaRex: should I sync now then and run main script? [20:25:53] You can always force-merge if the test is broken on the wmf branch. [20:26:10] (03CR) 10CI reject: [V: 04-1] Restore auth request ID from before namespacing [extensions/ConfirmEdit] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827203 (https://phabricator.wikimedia.org/T316410) (owner: 10Gergő Tisza) [20:26:25] (03CR) 10CI reject: [V: 04-1] WIP: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) (owner: 10Herron) [20:26:28] tgr: thanks - just click on Verified+2 in gerrit? [20:26:29] cjming: yes please. (please run the maint script with `time` and save the log output somewhere) [20:27:06] click Verified+2, delete any Verified-1 votes (if present), and click Submit [20:27:06] !log cjming@deploy1002 sync-file aborted: Backport: [[gerrit:827202|Fix boilerplate in maintenance scripts for WMF production (T316548)]] (duration: 00m 05s) [20:27:11] T316548: DiscussionTools' maintenance scripts cannot be executed in Wikimedia production - https://phabricator.wikimedia.org/T316548 [20:27:13] yeah, remove the V-1 from the test, add V+2, click submit [20:27:37] (03CR) 10Clare Ming: [V: 03+2 C: 03+2] Fix site notice spacing [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827534 (https://phabricator.wikimedia.org/T315595) (owner: 10Bernard Wang) [20:29:41] filed T316596 about the broken test [20:29:42] T316596: GrowthExperiments change tags test broken on wmf.26 branch - https://phabricator.wikimedia.org/T316596 [20:30:00] we should probably just disable it for the time being [20:31:01] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/DiscussionTools/maintenance: Backport: [[gerrit:827202|Fix boilerplate in maintenance scripts for WMF production (T316548)]] (duration: 03m 41s) [20:31:09] MatmaRex: it's been a while -- on mwmaint1002, i should cd into /srv/mediawiki/php-1.39.0-wmf.26 and run 'mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --wiki test' ? [20:31:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:31:29] --wiki testwiki [20:32:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:32:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:32:24] hmm [20:32:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:33:25] cjming: i think `--wiki testwiki` needs to be the first parameter [20:33:38] https://wikitech.wikimedia.org/wiki/Maintenance_server#Run_a_maintenance_script_on_a_wiki [20:34:15] otherwise i think that's correct, or at least it won't break anything (i haven) [20:34:24] (…'t actually ever done it myself) [20:34:32] MatmaRex: sounds good - i'll give it a whirl [20:35:47] (03PS1) 10Gergő Tisza: Temporarily disable change tag test [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827564 (https://phabricator.wikimedia.org/T316596) [20:36:01] cjming: ^ [20:37:10] tgr: thanks - i'll do that one next [20:38:11] syncing my vector patch now, running Bartosz's maint script, and moving onto Gergo's patches here in a minute [20:38:27] (03CR) 10CI reject: [V: 04-1] Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 1) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827191 (https://phabricator.wikimedia.org/T315583) (owner: 10Kosta Harlan) [20:38:44] the script might take a while [20:38:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Jclark-ctr) kubernetes1023 c6 u42 port 36 cableid 23000039 kubernetes1024 d8 u25 port 40 cableid 101760 [20:38:57] probably a few minutes, maybe a few hours [20:39:07] (so start it in a `screen` or something please) [20:39:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Jclark-ctr) [20:39:18] MatmaRex: ya -- it's at about 0.62% done atm [20:39:21] it's safe to stop if needed [20:39:27] ok great :D thanks [20:39:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:42:08] tgr: gonna force merge your patches for now since the temp disabling one will take a bit to pass CI [20:42:16] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:42:29] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.26/skins/Vector: Backport: [[gerrit:827534|Fix site notice spacing (T315595)]] (duration: 03m 46s) [20:42:34] T315595: [Regression]: Remove extra spacing between header and page title (from empty site notice) - https://phabricator.wikimedia.org/T315595 [20:42:44] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:44:09] (03CR) 10Clare Ming: [V: 03+2 C: 03+2] Restore auth request ID from before namespacing [extensions/ConfirmEdit] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827203 (https://phabricator.wikimedia.org/T316410) (owner: 10Gergő Tisza) [20:45:56] (03CR) 10Dzahn: [C: 04-1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/826394 was done" [puppet] - 10https://gerrit.wikimedia.org/r/826347 (https://phabricator.wikimedia.org/T315121) (owner: 10Dzahn) [20:46:29] tgr: 827203 on mwdebug1002 if it's testable [20:47:34] (03CR) 10Dzahn: [C: 03+1] "well, at least for Wikipedias it would still be an improvement" [puppet] - 10https://gerrit.wikimedia.org/r/826347 (https://phabricator.wikimedia.org/T315121) (owner: 10Dzahn) [20:47:38] (03CR) 10Dzahn: [C: 03+2] mediwiki/initsitestats: change time of day to run initsitestats [puppet] - 10https://gerrit.wikimedia.org/r/826347 (https://phabricator.wikimedia.org/T315121) (owner: 10Dzahn) [20:48:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:48:46] 10SRE, 10ops-eqiad, 10DC-Ops: dbprov1002 lost power redundancy - https://phabricator.wikimedia.org/T315439 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr noticed server still had fault pulled psu and reseated cable cleared fault [20:48:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:48:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:49:14] cjming: works, thanks [20:49:23] cool - syncing now [20:49:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:49:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install centrallog1002 - https://phabricator.wikimedia.org/T313858 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:49:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:50:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install graphite1005 - https://phabricator.wikimedia.org/T313853 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:51:10] (03CR) 10Clare Ming: [V: 03+2 C: 03+2] Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 1) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827191 (https://phabricator.wikimedia.org/T315583) (owner: 10Kosta Harlan) [20:53:25] tgr: 827191 up on mwdebug1002 if you can verify [20:53:38] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/ConfirmEdit/includes/Auth/CaptchaAuthenticationRequest.php: Backport: [[gerrit:827203|Restore auth request ID from before namespacing (T316410)]] (duration: 03m 45s) [20:53:43] T316410: Account creation (captcha handling) not working - https://phabricator.wikimedia.org/T316410 [20:54:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:54:40] MatmaRex: script is at ~8% [20:55:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:55:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:55:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:56:09] cjming: thanks. i guess we can check on it tomorrow [20:56:55] cjming: that half is a no-op, the patch has been split to avoid race. Verified that it doesn't cause errors. [20:57:14] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:57:14] ...avoid race conditions. [20:57:16] MatmaRex: it's ok to just let it run? i forgot to ask here beforehand if it's ok to run [20:57:23] tgr: sounds good - syncing then [20:57:58] cjming: yeah [20:58:42] MatmaRex: ok then - i'll send you output once it's finished [20:58:51] thanks [21:00:05] Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220829T2100). [21:01:32] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/GrowthExperiments/includes/WelcomeSurveyHooks.php: Backport: [[gerrit:827191|Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 1) (T315583 T316311)]] (duration: 03m 36s) [21:01:39] T316311: GrowthExperiments\WelcomeSurveyHooks->onCentralAuthPostLoginRedirect does write on GET - https://phabricator.wikimedia.org/T316311 [21:01:39] T315583: PHP Warning: Invalid argument supplied for foreach() at GlobalFunctions (from GrowthExperiments WelcomeSurveyHooks.php) - https://phabricator.wikimedia.org/T315583 [21:01:43] (03CR) 10Clare Ming: [C: 03+2] Temporarily disable change tag test [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827564 (https://phabricator.wikimedia.org/T316596) (owner: 10Gergő Tisza) [21:03:47] (03CR) 10Clare Ming: [V: 03+2 C: 03+2] Temporarily disable change tag test [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827564 (https://phabricator.wikimedia.org/T316596) (owner: 10Gergő Tisza) [21:04:22] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:06:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:06:22] unless someone else needs the window, i'll continue with koi's patches [21:06:41] I'm here [21:06:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:06:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:06:51] tgr: your patches should be live - just syncing 827564 now [21:07:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:08:28] thanks cjming! are you also deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/827549 ? [21:08:42] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/GrowthExperiments/tests/selenium/specs/homepage.js: Backport: [[gerrit:827564|Temporarily disable change tag test (T316596)]] (duration: 03m 49s) [21:08:47] T316596: GrowthExperiments change tags test broken on wmf.26 branch - https://phabricator.wikimedia.org/T316596 [21:08:49] I can do it if you prefer [21:09:12] RECOVERY - IPMI Sensor Status on dbprov1002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:09:22] tgr: i didn't see that one -- does that need to go out today? [21:09:58] i missed it - sorry [21:10:13] i can do it now [21:10:23] yeah, that's the other half of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/827191 [21:10:40] probably should have put it on separate lines [21:10:52] thanks! [21:11:07] (03CR) 10Clare Ming: [C: 03+2] Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 2) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827549 (https://phabricator.wikimedia.org/T315583) (owner: 10Gergő Tisza) [21:13:26] koi: sorry about that [21:13:32] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:15:00] cjming: It's ok, I'll put them to another window if you thought it will take too much time [21:15:45] koi: if that's ok, that'd be great [21:18:04] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:19:33] tgr: i have to run in a few minutes - sounds like you have access and can finish deploying part 2 of your survey patch? [21:19:57] cjming: will do. Thanks for the backports so far! [21:20:11] tgr: np! i just +2'd it and was waiting for CI to finish [21:20:48] koi: if you are still aroundm I can do your changes afterwards [21:21:06] or beforewards, GrowthExperiments patches take super long to merge [21:21:31] tgr: sure, it will be great [21:22:59] (03CR) 10Gergő Tisza: [C: 03+2] bewikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826677 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [21:24:11] (03Merged) 10jenkins-bot: bewikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826677 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [21:26:27] (03PS2) 10Gergő Tisza: euwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826678 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [21:27:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:28:30] (03CR) 10Gergő Tisza: [C: 03+2] euwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826678 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [21:28:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:28:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:29:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:29:47] (03Merged) 10jenkins-bot: euwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826678 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [21:32:46] (03PS2) 10Gergő Tisza: cswikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826679 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [21:34:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:35:40] (03CR) 10Gergő Tisza: [C: 03+2] cswikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826679 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [21:35:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:35:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:35:56] koi: do you want to test the logos on mwdebug? [21:36:09] tgr: yes [21:36:24] (03Merged) 10jenkins-bot: cswikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826679 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [21:36:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:37:25] koi: they are on mwdebug1002 [21:37:32] looking [21:38:24] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:38:39] tgr: all tested and LGTM [21:38:50] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:42:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:43:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:43:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:44:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:44:37] !log tgr@deploy1002 Synchronized logos/config.yaml: Config: Adjust width-height ratio of logos for [[gerrit:826677|bewikisource]], [[gerrit:826678|euwikisource]], [[gerrit:826679|cswikisource]] to fix display issue (T310961) (duration: 03m 45s) [21:44:41] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [21:48:53] !log tgr@deploy1002 Synchronized wmf-config/logos.php: Config: Adjust width-height ratio of logos for [[gerrit:826677|bewikisource]], [[gerrit:826678|euwikisource]], [[gerrit:826679|cswikisource]] to fix display issue (T310961) (duration: 03m 34s) [21:50:02] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:50:14] 10SRE, 10ops-codfw, 10DBA: db2149 is sad after reboot - https://phabricator.wikimedia.org/T316494 (10Papaul) @Ladsgroup you welcome [21:53:14] !log tgr@deploy1002 Synchronized static/images/project-logos: Config: Adjust width-height ratio of logos for [[gerrit:826677|bewikisource]], [[gerrit:826678|euwikisource]], [[gerrit:826679|cswikisource]] to fix display issue (T310961) (duration: 03m 59s) [21:53:19] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [21:55:04] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:56:45] koi: should be all live [21:57:16] thanks! [22:02:16] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:08:19] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10greg) [22:13:10] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@57fb704]: re-deploy HEAD to attempt to get artifacts directory populated on an-airflow1001 [22:13:14] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@57fb704]: re-deploy HEAD to attempt to get artifacts directory populated on an-airflow1001 (duration: 00m 04s) [22:14:05] (03CR) 10CI reject: [V: 04-1] Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 2) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827549 (https://phabricator.wikimedia.org/T315583) (owner: 10Gergő Tisza) [22:21:36] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:33:19] (03CR) 10Gergő Tisza: [V: 03+2] Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 2) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/827549 (https://phabricator.wikimedia.org/T315583) (owner: 10Gergő Tisza) [22:38:38] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:39:08] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@57fb704]: force re-deploy HEAD to attempt to get artifacts directory populated on an-airflow1001 [22:39:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:39:44] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/GrowthExperiments/extension.json: Backport: [[gerrit:827549|Fix WelcomeSurvey CentralAuthPostLoginRedirect hook (step 2)]] (duration: 03m 53s) [22:40:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:40:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:40:58] !log UTC late backport window done [22:41:00] (finally) [22:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:09] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@57fb704]: force re-deploy HEAD to attempt to get artifacts directory populated on an-airflow1001 (duration: 02m 01s) [22:41:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:43:30] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:56:11] (03PS1) 10Zabe: Stop all PHP 7.4 user traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827600 (https://phabricator.wikimedia.org/T316601) [22:59:42] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:03:44] PROBLEM - nova instance creation test on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:04:41] (03PS3) 10Krinkle: Explicitly set wgMessageCacheType=mcrouter (avoid newAnything in prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822119 (https://phabricator.wikimedia.org/T186673) [23:04:44] (03CR) 10Krinkle: [C: 03+2] Explicitly set wgMessageCacheType=mcrouter (avoid newAnything in prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822119 (https://phabricator.wikimedia.org/T186673) (owner: 10Krinkle) [23:05:04] * Krinkle staging on mwdebug1002 [23:05:13] (03CR) 10Krinkle: [C: 03+1] Stop all PHP 7.4 user traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827600 (https://phabricator.wikimedia.org/T316601) (owner: 10Zabe) [23:05:31] (03PS2) 10Krinkle: Enable wgKartographerStaticMapframe on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825840 (https://phabricator.wikimedia.org/T314750) [23:05:34] (03CR) 10Krinkle: [C: 03+1] Enable wgKartographerStaticMapframe on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825840 (https://phabricator.wikimedia.org/T314750) (owner: 10Krinkle) [23:05:48] (03Merged) 10jenkins-bot: Explicitly set wgMessageCacheType=mcrouter (avoid newAnything in prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822119 (https://phabricator.wikimedia.org/T186673) (owner: 10Krinkle) [23:08:43] (03CR) 10Gergő Tisza: [C: 03+1] "Ugh, apparently the JS version of sessionInSample interprets 0 as never, and the PHP version just dies." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827600 (https://phabricator.wikimedia.org/T316601) (owner: 10Zabe) [23:11:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:12:33] (03CR) 10Krinkle: [C: 03+2] Stop all PHP 7.4 user traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827600 (https://phabricator.wikimedia.org/T316601) (owner: 10Zabe) [23:13:06] !log krinkle@deploy1002 Synchronized wmf-config/: Id9707db2273b31e12 (duration: 03m 48s) [23:14:20] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:14:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:14:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:14:46] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:15:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:17:40] (03CR) 10Krinkle: [C: 03+2] "I believe this would not drain traffic immediately as the cookie is valid for 7 days, and will only be removed by the JS to inform the 2nd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825840 (https://phabricator.wikimedia.org/T314750) (owner: 10Krinkle) [23:17:48] (03CR) 10Legoktm: [C: 03+1] "Nice! +2, but I can't deploy it right now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827546 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [23:17:52] (03PS2) 10Krinkle: Stop all PHP 7.4 user traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827600 (https://phabricator.wikimedia.org/T316601) (owner: 10Zabe) [23:17:55] (03CR) 10Krinkle: [C: 03+2] Stop all PHP 7.4 user traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827600 (https://phabricator.wikimedia.org/T316601) (owner: 10Zabe) [23:19:11] (03Merged) 10jenkins-bot: Stop all PHP 7.4 user traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827600 (https://phabricator.wikimedia.org/T316601) (owner: 10Zabe) [23:21:23] (03PS2) 10Krinkle: logos: Raise error if logo is too tall [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827546 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [23:22:10] (03CR) 10Zabe: Enable wgKartographerStaticMapframe on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825840 (https://phabricator.wikimedia.org/T314750) (owner: 10Krinkle) [23:22:40] (03CR) 10Krinkle: [C: 03+2] "Confirmed that `tox -e logos -- generate` runs cleanly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827546 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [23:23:11] (03CR) 10Krinkle: [C: 03+2] "I believe this would not drain traffic immediately as the cookie is valid for 7 days, and will only be removed by the JS to inform the 2nd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827600 (https://phabricator.wikimedia.org/T316601) (owner: 10Zabe) [23:23:16] (03PS3) 10Krinkle: Enable wgKartographerStaticMapframe on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825840 (https://phabricator.wikimedia.org/T314750) [23:23:25] (03Merged) 10jenkins-bot: logos: Raise error if logo is too tall [mediawiki-config] - 10https://gerrit.wikimedia.org/r/827546 (https://phabricator.wikimedia.org/T310961) (owner: 10Stang) [23:23:35] (03PS4) 10Krinkle: Enable wgKartographerStaticMapframe on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825840 (https://phabricator.wikimedia.org/T314750) [23:23:38] (03CR) 10Krinkle: Enable wgKartographerStaticMapframe on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825840 (https://phabricator.wikimedia.org/T314750) (owner: 10Krinkle) [23:24:25] (03Merged) 10jenkins-bot: Enable wgKartographerStaticMapframe on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825840 (https://phabricator.wikimedia.org/T314750) (owner: 10Krinkle) [23:24:42] !log krinkle@deploy1002 Synchronized wmf-config/: I5e0e5ad965f64810af7 (duration: 03m 27s) [23:25:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:26:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:26:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:27:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:28:50] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:32:31] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I15a33444e27afa (duration: 03m 42s) [23:32:46] (03PS2) 10Krinkle: Undeploy ShortUrl extension from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825833 (https://phabricator.wikimedia.org/T314750) [23:32:49] (03CR) 10Krinkle: [C: 03+2] Undeploy ShortUrl extension from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825833 (https://phabricator.wikimedia.org/T314750) (owner: 10Krinkle) [23:32:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:32:59] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade fasw to Junos 21 - https://phabricator.wikimedia.org/T316542 (10Dwisehaupt) @ayounsi We have a maintenance week for frack scheduled for Sep 26-30. Would sometime that week be good for you? We could do fasw1-c-codfw before then i... [23:33:39] (03Merged) 10jenkins-bot: Undeploy ShortUrl extension from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825833 (https://phabricator.wikimedia.org/T314750) (owner: 10Krinkle) [23:33:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:33:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:35:01] (03PS2) 10Krinkle: Disable wgCiteResponsiveReferences on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825834 [23:35:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:35:46] (03Abandoned) 10Krinkle: Disable wgCiteResponsiveReferences on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825834 (owner: 10Krinkle) [23:36:34] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:40:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:40:15] !log krinkle@deploy1002 Synchronized wmf-config/: I9f17d80d9d91 (duration: 03m 53s) [23:41:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:41:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:42:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:45:50] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:55:56] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28