[00:00:05] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210624T0000). [00:00:48] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [00:01:01] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1441.eqiad.wmnet with reason: REIMAGE [00:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:14] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [00:01:31] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1440.eqiad.wmnet with reason: REIMAGE [00:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:09] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1439.eqiad.wmnet with reason: REIMAGE [00:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:02] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1442.eqiad.wmnet with reason: REIMAGE [00:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1441.eqiad.wmnet with reason: REIMAGE [00:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:03] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1443.eqiad.wmnet with reason: REIMAGE [00:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:51] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1442.eqiad.wmnet with reason: REIMAGE [00:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:59] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1443.eqiad.wmnet with reason: REIMAGE [00:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:02] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1445.eqiad.wmnet with reason: REIMAGE [00:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:08] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1446.eqiad.wmnet with reason: REIMAGE [00:11:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1445.eqiad.wmnet with reason: REIMAGE [00:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:03] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1447.eqiad.wmnet with reason: REIMAGE [00:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:26] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1446.eqiad.wmnet with reason: REIMAGE [00:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:41] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1447.eqiad.wmnet with reason: REIMAGE [00:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:42] 10SRE, 10serviceops, 10Parsoid (Tracking): Maybe consider consolidating parsoid-* and restbase-* proxy services, respectively - https://phabricator.wikimedia.org/T285445 (10Arlolra) Also consider T279825#7174049 [00:22:05] (03Abandoned) 10Arlolra: Use restbase-for-services for VE's VirtualRestClient calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699434 (https://phabricator.wikimedia.org/T279825) (owner: 10Arlolra) [00:55:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1444.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1444.eqiad.wmnet'] ` [00:59:16] (03PS1) 10Krinkle: media: Make the file metadata "_error" check looser [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701191 (https://phabricator.wikimedia.org/T285431) [01:11:07] !log deployment-memc08 and -memc09: apt-get install memkeys (already installed on deployment-mediawiki11) [01:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:14] argh, wrong log [01:25:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:27:02] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:34:12] (03PS2) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [01:38:45] (03CR) 10Legoktm: "This got a bit more involved than I was expecting, but I tested the various commands in isolation on mwmaint2002 and it all seemed to work" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [01:41:41] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [01:44:56] 10SRE, 10conftool, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10Legoktm) I updated https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/701053/ with a process that should address all the conc... [02:05:36] (03PS3) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [02:06:42] (03PS1) 10Legoktm: sre.switchdc.mediawiki: Update for periodic job changes in spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) [02:06:56] (03CR) 10jerkins-bot: [V: 04-1] sre.switchdc.mediawiki: Update for periodic job changes in spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [02:07:20] (03PS2) 10Legoktm: sre.switchdc.mediawiki: Update for periodic job changes in spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) [02:07:46] (03PS2) 10Legoktm: sre.switchdc.mediawiki: Warm up caches in api_appserver cluster too [cookbooks] - 10https://gerrit.wikimedia.org/r/700704 (https://phabricator.wikimedia.org/T269179) [02:11:01] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [02:55:00] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:09] (03CR) 10Krinkle: "I don't have a strong preference one way or the other, but my takeaway was that we'd loop over both only for urls-server." [cookbooks] - 10https://gerrit.wikimedia.org/r/700704 (https://phabricator.wikimedia.org/T269179) (owner: 10Legoktm) [03:50:54] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [04:29:46] (03PS1) 10Marostegui: Revert "db2079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/701204 [04:34:53] (03CR) 10Marostegui: [C: 03+2] Revert "db2079: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/701204 (owner: 10Marostegui) [04:52:00] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [05:16:29] (03PS1) 10Samwilson: Enable OCR tool on all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701225 (https://phabricator.wikimedia.org/T285311) [05:17:58] (03Abandoned) 10Samwilson: Remove defunct feature flag $wgWikisourceEnableOcr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701016 (https://phabricator.wikimedia.org/T285311) (owner: 10Samwilson) [05:28:18] (03CR) 10DannyS712: [C: 03+1] "Code-wise this is good, assuming the product decision is to enable everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701225 (https://phabricator.wikimedia.org/T285311) (owner: 10Samwilson) [05:40:52] (03PS1) 10Ryan Kemper: elasticsearch: further refactor rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/701247 [05:42:05] (03PS2) 10Ryan Kemper: elasticsearch: further refactor rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/701247 (https://phabricator.wikimedia.org/T280221) [05:46:49] (03CR) 10Ryan Kemper: "See commit message for changelog." [cookbooks] - 10https://gerrit.wikimedia.org/r/701247 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [06:41:31] (03PS1) 10Elukey: Rename superset's hiera config [labs/private] - 10https://gerrit.wikimedia.org/r/701327 (https://phabricator.wikimedia.org/T268219) [06:41:52] (03CR) 10Elukey: [V: 03+2 C: 03+2] Rename superset's hiera config [labs/private] - 10https://gerrit.wikimedia.org/r/701327 (https://phabricator.wikimedia.org/T268219) (owner: 10Elukey) [06:43:22] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:14:20] (03CR) 10Legoktm: swift: Only run swiftrepl-mw in the active datacenter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701052 (https://phabricator.wikimedia.org/T285373) (owner: 10Legoktm) [07:14:28] (03PS2) 10Legoktm: swift: Only run swiftrepl-mw in the active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/701052 (https://phabricator.wikimedia.org/T285373) [07:15:32] (03CR) 10Muehlenhoff: [C: 03+2] return-tgt-for-user: Fix date parsing [puppet] - 10https://gerrit.wikimedia.org/r/701116 (owner: 10Muehlenhoff) [07:15:38] (03PS3) 10Ema: varnish: add error counter to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/701070 (https://phabricator.wikimedia.org/T284576) [07:20:13] (03CR) 10Ema: [C: 03+2] varnish: add error counter to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/701070 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema) [07:23:40] (03PS3) 10Ema: varnish: install mtail programs in a loop [puppet] - 10https://gerrit.wikimedia.org/r/701083 [07:25:32] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701083 (owner: 10Ema) [07:26:45] (03CR) 10Ema: [C: 03+2] varnish: install mtail programs in a loop [puppet] - 10https://gerrit.wikimedia.org/r/701083 (owner: 10Ema) [07:26:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s8 weights T284897', diff saved to https://phabricator.wikimedia.org/P16710 and previous config saved to /var/cache/conftool/dbconfig/20210624-072657-marostegui.json [07:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:03] T284897: Pre DC switchover eqiad -> codfw DB work - https://phabricator.wikimedia.org/T284897 [07:31:28] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10valerio.bozzolan) >>! In T261694#7172911, @AntiCompositeNumber wrote: > Are WMF map tiles in use elsewhere on wikimedia.it, or... [07:31:36] (03PS1) 10Legoktm: mailman3: Fix logrotate, set retention to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/701333 (https://phabricator.wikimedia.org/T285376) [07:32:53] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29982/console" [puppet] - 10https://gerrit.wikimedia.org/r/701333 (https://phabricator.wikimedia.org/T285376) (owner: 10Legoktm) [07:36:03] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Fix Mailman3 log rotate - https://phabricator.wikimedia.org/T285376 (10Legoktm) I'll file a bug upstream in Debian too. [07:41:32] (03CR) 10Legoktm: [C: 03+2] mailman: Drop absented files and packages [puppet] - 10https://gerrit.wikimedia.org/r/697635 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [07:42:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s7 weights T284897', diff saved to https://phabricator.wikimedia.org/P16711 and previous config saved to /var/cache/conftool/dbconfig/20210624-074200-marostegui.json [07:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:09] T284897: Pre DC switchover eqiad -> codfw DB work - https://phabricator.wikimedia.org/T284897 [07:44:06] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:45:02] (03CR) 10Legoktm: "We still have mailman-mailman02.mailman.eqiad1.wikimedia.cloud running role(lists3), should we just shut that down now?" [puppet] - 10https://gerrit.wikimedia.org/r/698306 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [07:49:05] (03PS1) 10Kormat: mariadb: Set most sections to bidi replication. [puppet] - 10https://gerrit.wikimedia.org/r/701335 (https://phabricator.wikimedia.org/T284897) [07:51:38] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 9 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29983/console" [puppet] - 10https://gerrit.wikimedia.org/r/701335 (https://phabricator.wikimedia.org/T284897) (owner: 10Kormat) [07:53:26] (03CR) 10Legoktm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [07:56:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s6 weights T284897', diff saved to https://phabricator.wikimedia.org/P16712 and previous config saved to /var/cache/conftool/dbconfig/20210624-075613-marostegui.json [07:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:19] T284897: Pre DC switchover eqiad -> codfw DB work - https://phabricator.wikimedia.org/T284897 [07:57:34] (03CR) 10Marostegui: [C: 03+1] mariadb: Set most sections to bidi replication. [puppet] - 10https://gerrit.wikimedia.org/r/701335 (https://phabricator.wikimedia.org/T284897) (owner: 10Kormat) [08:02:21] (03CR) 10Jcrespo: "> I think the underlying question is: can we safely rely on the 5yr/indefinite backup that we made or should we continue doing regular bac" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [08:06:10] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Set most sections to bidi replication. [puppet] - 10https://gerrit.wikimedia.org/r/701335 (https://phabricator.wikimedia.org/T284897) (owner: 10Kormat) [08:08:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 0:45:00 on 216 hosts with reason: Change replication monitoring config T284897 [08:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:24] T284897: Pre DC switchover eqiad -> codfw DB work - https://phabricator.wikimedia.org/T284897 [08:09:37] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on 216 hosts with reason: Change replication monitoring config T284897 [08:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1130 from s5 api T284897', diff saved to https://phabricator.wikimedia.org/P16713 and previous config saved to /var/cache/conftool/dbconfig/20210624-080945-marostegui.json [08:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s5 weights T284897', diff saved to https://phabricator.wikimedia.org/P16714 and previous config saved to /var/cache/conftool/dbconfig/20210624-081137-marostegui.json [08:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s5 weights T284897', diff saved to https://phabricator.wikimedia.org/P16715 and previous config saved to /var/cache/conftool/dbconfig/20210624-081251-marostegui.json [08:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s5 weights T284897', diff saved to https://phabricator.wikimedia.org/P16716 and previous config saved to /var/cache/conftool/dbconfig/20210624-081409-marostegui.json [08:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:15] T284897: Pre DC switchover eqiad -> codfw DB work - https://phabricator.wikimedia.org/T284897 [08:30:34] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:42] (03PS1) 10Muehlenhoff: return-tgt-for-user: Fix date parsing even more [puppet] - 10https://gerrit.wikimedia.org/r/701342 [08:35:45] (03CR) 10Muehlenhoff: [C: 03+2] return-tgt-for-user: Fix date parsing even more [puppet] - 10https://gerrit.wikimedia.org/r/701342 (owner: 10Muehlenhoff) [08:36:34] 10SRE, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Dzahn) [08:37:08] 10SRE, 10serviceops, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Dzahn) [08:37:21] 10SRE, 10GitLab, 10serviceops, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Dzahn) [08:40:21] (03CR) 10Ladsgroup: [C: 03+2] "deploying" [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701191 (https://phabricator.wikimedia.org/T285431) (owner: 10Krinkle) [08:41:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s3 weights T284897', diff saved to https://phabricator.wikimedia.org/P16717 and previous config saved to /var/cache/conftool/dbconfig/20210624-084147-marostegui.json [08:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:53] T284897: Pre DC switchover eqiad -> codfw DB work - https://phabricator.wikimedia.org/T284897 [08:42:44] 10SRE, 10GitLab, 10serviceops, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10MoritzMuehlenhoff) Looks fine, please create in row C or D to better balance our our capacity. In general let's avoid notes like "same as gitlab1001", if one looks back at the... [08:44:10] (03CR) 10Ladsgroup: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/698306 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [08:48:44] 10SRE, 10Infrastructure-Foundations, 10Mail: Please create "grant@wikipedia.org" email handle to use for annual fundraising email test - https://phabricator.wikimedia.org/T285432 (10Aklapper) [08:48:49] (03CR) 10Ladsgroup: [C: 03+1] "Conceptually look good to me but I haven't done much in logrotate to confidently say it's correct in all aspects" [puppet] - 10https://gerrit.wikimedia.org/r/701333 (https://phabricator.wikimedia.org/T285376) (owner: 10Legoktm) [08:55:32] !log root@lists1001:/var/log/mailman# rm -rf * [08:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:13] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman3: Fix logrotate, set retention to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/701333 (https://phabricator.wikimedia.org/T285376) (owner: 10Legoktm) [09:00:27] (03Merged) 10jenkins-bot: media: Make the file metadata "_error" check looser [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701191 (https://phabricator.wikimedia.org/T285431) (owner: 10Krinkle) [09:01:45] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Fix Mailman3 log rotate - https://phabricator.wikimedia.org/T285376 (10Legoktm) a:03Legoktm Will close tomorrow after verifying logrotate worked. [09:02:33] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.11/includes: Backport: [[gerrit:701191|media: Make the file metadata "_error" check looser (T285431)]] (duration: 01m 12s) [09:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:38] T285431: PHP Notice: Undefined index: frameCount - https://phabricator.wikimedia.org/T285431 [09:11:57] 10SRE, 10Privacy Engineering, 10Security-Team, 10Wikimedia-Mailing-lists, and 2 others: /var/log/mailman/subscribe* has PII (IP addresses) from August 2020 - https://phabricator.wikimedia.org/T281619 (10Legoktm) [09:17:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s2 weights T284897', diff saved to https://phabricator.wikimedia.org/P16718 and previous config saved to /var/cache/conftool/dbconfig/20210624-091753-marostegui.json [09:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:59] T284897: Pre DC switchover eqiad -> codfw DB work - https://phabricator.wikimedia.org/T284897 [09:19:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16719 and previous config saved to /var/cache/conftool/dbconfig/20210624-091949-marostegui.json [09:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16720 and previous config saved to /var/cache/conftool/dbconfig/20210624-092029-marostegui.json [09:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16721 and previous config saved to /var/cache/conftool/dbconfig/20210624-092105-marostegui.json [09:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16722 and previous config saved to /var/cache/conftool/dbconfig/20210624-092157-marostegui.json [09:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16723 and previous config saved to /var/cache/conftool/dbconfig/20210624-092226-marostegui.json [09:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:11] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Marostegui) [09:29:48] (03PS1) 10Ayounsi: Port labs-in4/6 to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/701347 [09:37:36] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [09:38:56] RECOVERY - Host wdqs1013 is UP: PING WARNING - Packet loss = 80%, RTA = 0.28 ms [09:44:29] (03PS1) 10Muehlenhoff: Add logout.d script for the IDP (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) [09:48:48] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701351 (https://phabricator.wikimedia.org/T284481) [09:49:58] 10SRE, 10Infrastructure-Foundations, 10netops: Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10cmooney) [09:50:21] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: Enable link recommendation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701351 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [09:52:00] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable link recommendation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701351 (https://phabricator.wikimedia.org/T284481) [09:53:41] 10SRE, 10Infrastructure-Foundations, 10netops: Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10cmooney) Some work has already been completed to update the current rules: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/701347 [09:58:41] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701351 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [10:00:05] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210624T1000). [10:03:51] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Enable link recommendation feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701351 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [10:11:01] (03PS3) 10Kosta Harlan: GrowthExperiments: Enable link recommendation feature for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701351 (https://phabricator.wikimedia.org/T284481) [10:12:41] (03PS2) 10Muehlenhoff: Add logout.d script for the IDP [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) [10:13:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove backup for sretest [puppet] - 10https://gerrit.wikimedia.org/r/701084 (owner: 10Muehlenhoff) [10:20:23] (03PS1) 10Muehlenhoff: Drop account end date for iflorez [puppet] - 10https://gerrit.wikimedia.org/r/701357 [10:21:43] (03PS1) 10Ema: varnish: add varnish_sli_good, varnish_sli_all counters [puppet] - 10https://gerrit.wikimedia.org/r/701358 (https://phabricator.wikimedia.org/T284576) [10:41:27] (03PS1) 10Gergő Tisza: Re-apply "Add custom signup flow for donors", step 1 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701361 (https://phabricator.wikimedia.org/T284799) [10:42:23] (03PS2) 10Gergő Tisza: Re-apply "Add custom signup flow for donors", step 1 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701361 (https://phabricator.wikimedia.org/T284799) [10:43:11] (03CR) 10Gergő Tisza: "All changes beyond simple omissions via `checkout -p @~` are split into PS2 for ease of review." [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701361 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [10:51:31] 10SRE, 10Infrastructure-Foundations, 10Mail: Please create "grant@wikipedia.org" email handle to use for annual fundraising email test - https://phabricator.wikimedia.org/T285432 (10faidon) Hi @MNoorWMF - is Grant aware this fundraising test in his name is happening? If so, would it possible to Cc me on the... [10:51:41] (03PS1) 10Gergő Tisza: Re-apply "Add custom signup flow for donors", step 2 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701363 (https://phabricator.wikimedia.org/T284799) [10:51:49] (03PS1) 10Gergő Tisza: Re-apply "Add custom signup flow for donors", step 3 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701364 (https://phabricator.wikimedia.org/T284799) [10:52:37] (03CR) 10Urbanecm: [C: 04-1] Re-apply "Add custom signup flow for donors", step 1 (032 comments) [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701361 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [10:54:21] apergos: you asked yesterday how to split otherwise undeployable patches. If you are building some notes for trainees, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/688200 is a past example that can be probably included in those notes :). [10:54:35] (and of course the patch tg.r is preparing now) [10:54:52] (03CR) 10Gergő Tisza: Re-apply "Add custom signup flow for donors", step 1 (032 comments) [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701361 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [10:55:44] I am not writing the notes yet but yes I am absolutely accumulating examples [10:55:45] (03PS3) 10Gergő Tisza: Re-apply "Add custom signup flow for donors", step 1 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701361 (https://phabricator.wikimedia.org/T284799) [10:55:47] (03PS2) 10Gergő Tisza: Re-apply "Add custom signup flow for donors", step 2 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701363 (https://phabricator.wikimedia.org/T284799) [10:55:49] (03PS2) 10Gergő Tisza: Re-apply "Add custom signup flow for donors", step 3 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701364 (https://phabricator.wikimedia.org/T284799) [10:55:56] so definitely please toss those my way! [10:56:02] will do! [10:56:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/701357 (owner: 10Muehlenhoff) [10:58:50] (03CR) 10jerkins-bot: [V: 04-1] Re-apply "Add custom signup flow for donors", step 2 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701363 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [11:00:05] Amir1, Lucas_WMDE, apergos, and duesen: (Dis)respected human, time to deploy EU Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210624T1100). Please do the needful. [11:00:05] kostajh and tgr: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] I'm here barely :-D [11:00:29] but I have not had time to look at the 4 patches yet. No one was signed up for a training when I checked earlier [11:00:29] i'm here too [11:00:31] Hello there! I was wondering, can this BC window be used "normally", or is it just for training? I'd have a last-minute patch to deploy if possible [11:00:32] I can do the deploys [11:00:38] looking again just to make sure [11:00:48] hi [11:00:50] Daimona: it's used normally unless someone shows up for training [11:00:50] Daimona: people are invited to schedule patches here! [11:00:59] this is to make trainess able to look at real deployment :) [11:01:25] still no trainees listed [11:01:38] Noice :) I'll add it to the calendar then [11:01:42] and we're less than 6 patches so you can absolutely still get your ones in [11:03:10] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Enable link recommendation feature for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701351 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [11:03:35] {{done}}, heading to lunch now until it's my turn [11:03:36] (03CR) 10Gergő Tisza: [C: 03+2] Re-apply "Add custom signup flow for donors", step 1 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701361 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [11:03:58] (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendation feature for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701351 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [11:05:12] I'm under the impression jenkins will not like `$this->userOptionsManager->setOption` when`$this->userOptionsManager` can be null, but let's see -- maybe i'm wrong [11:05:50] kostajh: it's on mwdebug1001 [11:06:24] tgr: thanks [11:06:59] urbanecm: we'll see but AFAIK phan is the only check that cares about that, and phan can derive extra conditions for the type from being inside an if() [11:07:54] sure, let's see. After all, the only thing that matters is if it works :-). [11:08:35] tgr: I verified that the task doesn't appear if you're not in the experiment group, and that it does when you are. anything else to check here? [11:09:14] I don't think so. [11:10:03] (03CR) 10Muehlenhoff: [C: 03+2] Drop account end date for iflorez [puppet] - 10https://gerrit.wikimedia.org/r/701357 (owner: 10Muehlenhoff) [11:10:05] cool, let's do it then [11:10:59] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:701351|GrowthExperiments: Enable link recommendation feature for more wikis (T284481)]] (duration: 01m 07s) [11:11:04] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ChristineDeKock) Thanks, that resolved the kinit issue. One more thing: I am using the newpyter system as per the instructions [[ https://wikitech.wikimedia.org/wiki/Analytics/... [11:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:06] T284481: Deploy Add a link to the second set of wikis - https://phabricator.wikimedia.org/T284481 [11:11:48] (03PS2) 10Gergő Tisza: Enable OCR tool on all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701225 (https://phabricator.wikimedia.org/T285311) (owner: 10Samwilson) [11:12:13] (03CR) 10Gergő Tisza: [C: 03+2] Enable OCR tool on all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701225 (https://phabricator.wikimedia.org/T285311) (owner: 10Samwilson) [11:13:44] (03Merged) 10jenkins-bot: Enable OCR tool on all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701225 (https://phabricator.wikimedia.org/T285311) (owner: 10Samwilson) [11:13:50] @tgr can I add a patch to this deployment window? [11:13:53] (03PS3) 10DannyS712: Update $wgNamespacesToBeSearchedDefault for wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699290 (https://phabricator.wikimedia.org/T284793) [11:14:02] ^ thats the patch I'd like to add [11:14:22] DannyS712: sure [11:14:52] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Urbanecm) >>! In T284987#7174830, @ChristineDeKock wrote: > Thanks, that resolved the kinit issue. > > One more thing: I am using the newpyter system as per the instructions [[... [11:15:02] Daimona: it's on mwdebug1001 [11:15:20] Thank you, testing [11:15:29] DannyS712: can you add it to the deploy calendar? [11:15:42] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:15:59] tgr done [11:16:07] thats what I was working on [11:16:31] (03PS1) 10Jgiannelos: Add caching support for tegola [deployment-charts] - 10https://gerrit.wikimedia.org/r/701369 [11:19:08] (03CR) 10Jbond: [C: 04-1] "There is one -1 in here, the others are nits and can be addresses later when the script is converted to the wmdlib.logoutd class" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [11:19:44] (03PS2) 10Jgiannelos: Add caching support for tegola [deployment-charts] - 10https://gerrit.wikimedia.org/r/701369 [11:20:16] tgr: tested on a couple of wikisources, everything seems fine. [11:20:17] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ChristineDeKock) This does not work unfortunately. Is there anything else I can do? [11:21:01] (03CR) 10Gergő Tisza: [C: 03+2] Update $wgNamespacesToBeSearchedDefault for wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699290 (https://phabricator.wikimedia.org/T284793) (owner: 10DannyS712) [11:21:46] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:701225|Enable OCR tool on all Wikisources (T285311)]] (duration: 01m 06s) [11:21:47] (03Merged) 10jenkins-bot: Update $wgNamespacesToBeSearchedDefault for wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699290 (https://phabricator.wikimedia.org/T284793) (owner: 10DannyS712) [11:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:51] T285311: Enable OCR improvements on all remaining Wikisources - https://phabricator.wikimedia.org/T285311 [11:22:40] DannyS712: it's on mwdebug1001 [11:23:10] Thank you! [11:23:43] tgr confirmed to be working, https://wikimania.wikimedia.org/wiki/Special:Search shows 2021 and Main as the defaults when using mwdebug1001 after a shift-refresh [11:25:19] (03CR) 10Jbond: [C: 03+1] "LGTM but see nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701171 (https://phabricator.wikimedia.org/T285425) (owner: 10Legoktm) [11:25:37] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:699290|Update $wgNamespacesToBeSearchedDefault for wikimania (T284793)]] (duration: 01m 07s) [11:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:42] T284793: Add 2021 namespace to the default search of Wikimania wiki - https://phabricator.wikimedia.org/T284793 [11:27:13] (03Merged) 10jenkins-bot: Re-apply "Add custom signup flow for donors", step 1 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701361 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [11:28:12] (03CR) 10Gergő Tisza: [C: 03+2] Re-apply "Add custom signup flow for donors", step 2 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701363 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [11:30:00] tgr thanks, it works - do I need to stick around for anything else? [11:34:41] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.9/extensions/GrowthExperiments: Backport: [[gerrit:701361|Re-apply "Add custom signup flow for donors", step 1 (T284799 T284740 T284800 T285281)]] (duration: 01m 06s) [11:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:50] T284740: Donors to newcomers: design an enhanced account creation landing page - https://phabricator.wikimedia.org/T284740 [11:34:50] T285281: Donors to newcomers: go straight to homepage - https://phabricator.wikimedia.org/T285281 [11:34:51] T284800: Donors to newcomers: URL parameters - https://phabricator.wikimedia.org/T284800 [11:34:51] T284799: [EPIC] Encourage donors to create accounts - https://phabricator.wikimedia.org/T284799 [11:35:35] (03CR) 10Gergő Tisza: [C: 03+2] Re-apply "Add custom signup flow for donors", step 3 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701364 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [11:36:38] (03Abandoned) 10Jgiannelos: Add blubber variant for tile pregeneration image [software/tegola] (v0.14.x) - 10https://gerrit.wikimedia.org/r/699251 (owner: 10Jgiannelos) [11:37:24] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=registry2008.codfw.wmnet,dc=codfw,cluster=docker-registry [11:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:57] !log depooling registry2008 for some dragonfly testing [11:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:15] DannyS712: sorry, didn't see the question. No, that's all. [11:44:16] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on registry2008.codfw.wmnet with reason: Dragonfly tests (jayme) [11:44:17] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on registry2008.codfw.wmnet with reason: Dragonfly tests (jayme) [11:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:27] (03PS1) 10Jgiannelos: Add blubber variant for tile pregeneration image [software/tegola] - 10https://gerrit.wikimedia.org/r/701372 [11:45:09] (03CR) 10Jbond: [C: 03+2] spec_helper: switch to rspec mocha and add rspec_parrallel arguments [puppet] - 10https://gerrit.wikimedia.org/r/701077 (owner: 10Jbond) [11:45:14] (03CR) 10Jbond: [C: 03+2] rake_modules: update the dynamic spec test to use ParallelTests [puppet] - 10https://gerrit.wikimedia.org/r/700660 (owner: 10Jbond) [11:45:47] (03PS1) 10Jgiannelos: Add blubber variant for tile pregeneration image [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701373 [11:48:14] (03CR) 10Jgiannelos: "I abandoned the previous patch because I got confused with how git review works with target branches. Now it should be targeting the right" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701373 (owner: 10Jgiannelos) [11:50:38] (03Merged) 10jenkins-bot: Re-apply "Add custom signup flow for donors", step 2 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701363 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [11:51:52] (03CR) 10Jgiannelos: "I have already tested it locally in minikube using k8s cronjobs (WIP). It pregenerated tiles just fine by manually enqueuing z/x/y tasks f" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701373 (owner: 10Jgiannelos) [11:52:19] (03CR) 10Volans: "The change looks reasonable but I'm missing the details of the current implementation on the mediawiki side. Couple of nits inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [11:52:27] (03Merged) 10jenkins-bot: Re-apply "Add custom signup flow for donors", step 3 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.9) - 10https://gerrit.wikimedia.org/r/701364 (https://phabricator.wikimedia.org/T284799) (owner: 10Gergő Tisza) [11:53:00] (03PS10) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [11:53:03] (03CR) 10Volans: [C: 03+1] "This looks consistent with the related patch in spicerack. Ofc depends on that to be merged and released first." [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [11:53:09] !log import dragonfly_1.0.6-1 into buster-wikimedia [11:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:17] (03PS15) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [11:53:27] (03PS16) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [11:53:46] (03CR) 10Volans: "> Patch Set 2: Code-Review-1" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [11:55:32] (03CR) 10Volans: "one nit inline, looks sane to me" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701247 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [11:56:04] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [11:56:21] (03PS2) 10Ema: varnish: add counters for Varnish SLI [puppet] - 10https://gerrit.wikimedia.org/r/701358 (https://phabricator.wikimedia.org/T284576) [11:59:10] (03PS3) 10Ema: varnish: add counters for Varnish SLI [puppet] - 10https://gerrit.wikimedia.org/r/701358 (https://phabricator.wikimedia.org/T284576) [12:01:43] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701358 (https://phabricator.wikimedia.org/T284576) (owner: 10Ema) [12:02:57] (03PS1) 10Volans: idm: add usage examples in the docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701375 [12:08:28] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.9/extensions/GrowthExperiments: Backport: [[gerrit:701363|Re-apply "Add custom signup flow for donors", step 2 (T284799 T284740 T284800 T285281)]] (duration: 01m 06s) [12:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:36] T284740: Donors to newcomers: design an enhanced account creation landing page - https://phabricator.wikimedia.org/T284740 [12:08:37] T285281: Donors to newcomers: go straight to homepage - https://phabricator.wikimedia.org/T285281 [12:08:37] T284800: Donors to newcomers: URL parameters - https://phabricator.wikimedia.org/T284800 [12:08:37] T284799: [EPIC] Encourage donors to create accounts - https://phabricator.wikimedia.org/T284799 [12:09:58] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701375 (owner: 10Volans) [12:11:01] (03CR) 10Volans: [C: 03+2] idm: add usage examples in the docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701375 (owner: 10Volans) [12:11:40] (03PS2) 10Hnowlan: postgres::slave: remove problematic auto-replicate [puppet] - 10https://gerrit.wikimedia.org/r/700071 [12:12:30] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10faidon) Prioritization-wise, is there a reason why we're going for an IPv6 allocation while our IPv4 segmentation is still in flux or in progress? I fear that we're adding more features/problems t... [12:12:35] (03CR) 10jerkins-bot: [V: 04-1] postgres::slave: remove problematic auto-replicate [puppet] - 10https://gerrit.wikimedia.org/r/700071 (owner: 10Hnowlan) [12:13:29] (03Merged) 10jenkins-bot: idm: add usage examples in the docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701375 (owner: 10Volans) [12:16:26] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:18:07] (03CR) 10Marostegui: "Should we revert this once we've disabled eqiad -> codfw replication after the switchover?" [puppet] - 10https://gerrit.wikimedia.org/r/701335 (https://phabricator.wikimedia.org/T284897) (owner: 10Kormat) [12:18:43] !log tgr@deploy1002 Started scap: Backport: [[gerrit:701364|Re-apply "Add custom signup flow for donors", step 3 (T284799 T284740 T284800 T285281)]] [12:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:51] T284740: Donors to newcomers: design an enhanced account creation landing page - https://phabricator.wikimedia.org/T284740 [12:18:51] T285281: Donors to newcomers: go straight to homepage - https://phabricator.wikimedia.org/T285281 [12:18:51] T284800: Donors to newcomers: URL parameters - https://phabricator.wikimedia.org/T284800 [12:18:52] T284799: [EPIC] Encourage donors to create accounts - https://phabricator.wikimedia.org/T284799 [12:21:25] (03CR) 10Muehlenhoff: Add logout.d script for the IDP (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [12:21:28] (03PS3) 10Muehlenhoff: Add logout.d script for the IDP [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) [12:22:16] (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [12:26:26] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.8 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701377 [12:30:00] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.8 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701377 (owner: 10Volans) [12:33:39] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.8 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701377 (owner: 10Volans) [12:37:41] (03PS1) 10Volans: Upstream release v0.0.8 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/701379 [12:41:15] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.8 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/701379 (owner: 10Volans) [12:43:22] (03PS17) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [12:44:05] (03Merged) 10jenkins-bot: Upstream release v0.0.8 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/701379 (owner: 10Volans) [12:44:50] !log tgr@deploy1002 Finished scap: Backport: [[gerrit:701364|Re-apply "Add custom signup flow for donors", step 3 (T284799 T284740 T284800 T285281)]] (duration: 26m 07s) [12:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:59] T284740: Donors to newcomers: design an enhanced account creation landing page - https://phabricator.wikimedia.org/T284740 [12:44:59] T285281: Donors to newcomers: go straight to homepage - https://phabricator.wikimedia.org/T285281 [12:44:59] T284800: Donors to newcomers: URL parameters - https://phabricator.wikimedia.org/T284800 [12:45:00] T284799: [EPIC] Encourage donors to create accounts - https://phabricator.wikimedia.org/T284799 [12:45:50] !log EU deploys done [12:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Urbanecm) >>! In T284987#7174844, @ChristineDeKock wrote: > This does not work unfortunately. Is there anything else I can do? Can you share the username you use here? We can th... [13:10:02] (03PS3) 10Jbond: postgres::slave: remove problematic auto-replicate [puppet] - 10https://gerrit.wikimedia.org/r/700071 (owner: 10Hnowlan) [13:12:29] (03CR) 10Jbond: [C: 03+1] "I sent a fix (well another hack) to the rspec test failing. The issue here is related to us hacking around rspec so that we can test our " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/700071 (owner: 10Hnowlan) [13:16:38] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppet-rspec has trouble testing custom facts - https://phabricator.wikimedia.org/T285476 (10jbond) [13:17:06] (03PS4) 10Jbond: postgres::slave: remove problematic auto-replicate [puppet] - 10https://gerrit.wikimedia.org/r/700071 (https://phabricator.wikimedia.org/T285476) (owner: 10Hnowlan) [13:17:15] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ChristineDeKock) >>! In T284987#7175071, @Urbanecm wrote: >>>! In T284987#7174844, @ChristineDeKock wrote: >> This does not work unfortunately. Is there anything else I can do? >... [13:17:49] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/700071 (https://phabricator.wikimedia.org/T285476) (owner: 10Hnowlan) [13:18:16] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppet-rspec has trouble testing custom facts - https://phabricator.wikimedia.org/T285476 (10jbond) p:05Triage→03Medium [13:20:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10jbond) p:05Triage→03Medium [13:21:12] 10SRE, 10serviceops, 10Parsoid (Tracking): Maybe consider consolidating parsoid-* and restbase-* proxy services, respectively - https://phabricator.wikimedia.org/T285445 (10jbond) p:05Triage→03Medium [13:22:24] 10SRE, 10Infrastructure-Foundations, 10Mail: Please create "grant@wikipedia.org" email handle to use for annual fundraising email test - https://phabricator.wikimedia.org/T285432 (10jbond) p:05Triage→03Medium [13:22:51] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet does not undo manual "systemctl mask $unit" - https://phabricator.wikimedia.org/T285425 (10jbond) p:05Triage→03Medium [13:23:19] 10SRE, 10Wikimedia-Mailing-lists: Unicode in display name gets mangled in Mailman unsubscription notification - https://phabricator.wikimedia.org/T285377 (10jbond) p:05Triage→03Medium [13:23:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:23:45] 10SRE, 10Wikimedia-Mailing-lists: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10jbond) p:05Triage→03Medium [13:25:39] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for TChin - https://phabricator.wikimedia.org/T285326 (10jbond) p:05Triage→03Medium @MNadrofsky are you able to approve this request please. [13:29:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:29:40] !log uploaded python3-wmflib_0.0.8 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [13:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:49] (03CR) 10Kormat: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/701335 (https://phabricator.wikimedia.org/T284897) (owner: 10Kormat) [13:32:37] (03CR) 10Jbond: [C: 03+1] "lgtm" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [13:36:18] volans: \o/ [13:36:25] :) [13:38:15] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Ottomata) FYI I just wrote https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Accounts_and_passwords_explained%3A_LDAP%2FWikitech%2FMW_Developervs_shell%2Fssh%2Fposix_vs_K... [13:40:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10jijiki) [13:42:53] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Ottomata) > I have tried a number of username/password combinations with no luck. What is the error you get when you try? Can you also try logging out of Wikitech/ and then logg... [13:51:54] (03PS1) 10Krinkle: purgeParserCache.php: Print stats for time and iterations [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701285 (https://phabricator.wikimedia.org/T282761) [13:55:03] (03PS2) 10Krinkle: purgeParserCache.php: Print stats for time and iterations [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701285 [13:55:08] (03Abandoned) 10Krinkle: purgeParserCache.php: Print stats for time and iterations [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701285 (owner: 10Krinkle) [14:02:19] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10cmooney) Thanks @faidon for the comments. In terms of why it is being discussed, I'm trying to advance tasks outstanding for WMCS (as discussed by myself and @joanna_borun), and the IPv6 stuff se... [14:04:47] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10LSobanski) Hi. Do we have an idea for when these hosts could be available? We have ongoing issues with parsercache (see T282761) that we hope moving to the new HW will partially mitigate. [14:10:53] (03PS5) 10Hnowlan: postgres::slave: remove problematic auto-replicate [puppet] - 10https://gerrit.wikimedia.org/r/700071 (https://phabricator.wikimedia.org/T285476) [14:11:23] (03CR) 10Hnowlan: "Thanks a lot for the fixes and help! Wouldn't have figured a lot of this out on my own." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/700071 (https://phabricator.wikimedia.org/T285476) (owner: 10Hnowlan) [14:12:29] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ChristineDeKock) > What is the error you get when you try? The error is "Invalid username or password". I get this error when I do "ssh -N stat1008.eqiad.wmnet -L 8880:127.0.0.1:... [14:13:08] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29985/console" [puppet] - 10https://gerrit.wikimedia.org/r/700071 (https://phabricator.wikimedia.org/T285476) (owner: 10Hnowlan) [14:15:12] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29986/console" [puppet] - 10https://gerrit.wikimedia.org/r/700071 (https://phabricator.wikimedia.org/T285476) (owner: 10Hnowlan) [14:18:18] (03CR) 10Jelto: add job to weekly rebuild production-images (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [14:32:52] (03CR) 1020after4: "changes tested in wmcs and puppet catalog compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/701206 (owner: 1020after4) [14:46:35] !log Disabling puppet on P{C:Postgresql::Slave} (netboxdb2001,puppetdb2002, most maps hosts) to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/700071 [14:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:52] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] postgres::slave: remove problematic auto-replicate [puppet] - 10https://gerrit.wikimedia.org/r/700071 (https://phabricator.wikimedia.org/T285476) (owner: 10Hnowlan) [14:57:40] !log installing libxml2 security updates on buster [14:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:19] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Ottomata) > This seems unnecessary since I can log into Wikitech just fine with the current password Agree, unnecessary if you are sure of the password. > Christine will need LD... [14:59:55] !log restarting mw canaries to pick up libxml2 security update [14:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:40] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Reedy) [15:02:21] !log reenabling puppet on P{C:Postgresql::Slave} [15:02:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1444.eqiad.wmnet ` The log can be found in `/var/log/wmf-au... [15:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:00] (03PS1) 10Elukey: knative,kubeflow: improve the import of the build images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/701396 [15:03:09] (03PS2) 10Ottomata: Add a consumers.analytics-hadoop setting to automate ingestion of streams into HDFS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) [15:04:23] (03PS3) 10Ottomata: Add a consumers.analytics-hadoop setting to automate ingestion of streams into HDFS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) [15:04:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) [15:04:36] (03PS4) 10Muehlenhoff: Add logout.d script for the IDP [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) [15:04:52] (03CR) 10Muehlenhoff: Add logout.d script for the IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [15:06:38] (03CR) 10Jelto: "> Patch Set 1: Code-Review-1" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/701068 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [15:10:01] (03CR) 10Muehlenhoff: [C: 03+2] Add logout.d script for the IDP [puppet] - 10https://gerrit.wikimedia.org/r/701350 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [15:14:25] (03PS6) 10Hnowlan: postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) [15:21:16] (03CR) 10Hnowlan: [C: 03+2] postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [15:24:13] (03Merged) 10jenkins-bot: postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [15:24:38] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10ChristineDeKock) Fantastic, thanks for your help. It works now. [15:26:31] !log installing ruby-websocket-extensions security updates [15:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:26] !log installing jackson-databind security updates [15:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:40] (03PS2) 10Hnowlan: maps: make maps2007 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/700087 (https://phabricator.wikimedia.org/T269582) [15:33:16] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:37:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1444.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1444.eqiad.wmnet'] ` [15:42:30] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps2007.codfw.wmnet with reason: depooling and reimaging as buster replica [15:42:30] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps2007.codfw.wmnet with reason: depooling and reimaging as buster replica [15:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:55] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2007.codfw.wmnet [15:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:23] !log running `nodetool decommission` on maps2007 [15:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:31] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Ottomata) 05Open→03Resolved [15:45:32] (03PS4) 10Ottomata: Add a consumers.analytics-hadoop setting to automate ingestion of streams into HDFS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) [15:49:00] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Enable TLS termination on the mwdebug deployment. fix the service definition in the chart - https://phabricator.wikimedia.org/T284421 (10jijiki) a:03jijiki [15:49:41] 10SRE, 10MW-on-K8s, 10serviceops, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki) [16:00:04] jbond42 and cdanis: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210624T1600). [16:00:04] twentyafterfour: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:25] * twentyafterfour here [16:00:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/701206 is a trivial config change. [16:01:03] cool, I'll merge shortly [16:01:25] (03PS3) 10Effie Mouzeli: tegola-vector-tiles: add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/701138 (https://phabricator.wikimedia.org/T283159) [16:02:14] (03CR) 10CDanis: [C: 03+2] Remove reference to obsolete phabricator libraries. [puppet] - 10https://gerrit.wikimedia.org/r/701206 (owner: 1020after4) [16:02:41] twentyafterfour: puppet-merge complete [16:02:46] I can run `puppet agent --test` and verify on the phabricator server [16:03:03] 👍 [16:03:10] thanks cdanis! [16:03:13] np [16:03:45] (03PS1) 10David Caro: cinderutils.ensure: take into account filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/701405 [16:04:02] Error: Facter: error while resolving custom fact "postgres_replica_initialised": undefined method `match' for nil:NilClass [16:04:06] is that normal? [16:04:13] (03CR) 10jerkins-bot: [V: 04-1] cinderutils.ensure: take into account filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/701405 (owner: 10David Caro) [16:04:23] (03CR) 10MSantos: Add blubber variant for tile pregeneration image (031 comment) [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701373 (owner: 10Jgiannelos) [16:04:24] otherwise clean puppet run [16:10:02] twentyafterfour: oh, that's something I added earlier today - which host are you seeing that on? [16:10:36] (03PS2) 10David Caro: cinderutils.ensure: take into account filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/701405 [16:11:02] oh, I see it. writing a fix now. [16:11:13] hnowlan: phab1001 [16:11:20] (03CR) 10Jgiannelos: Add blubber variant for tile pregeneration image (031 comment) [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/701373 (owner: 10Jgiannelos) [16:12:01] !log restarted php7.3-fpm on phab1001 [16:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cinderutils.ensure: take into account filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/701405 (owner: 10David Caro) [16:21:17] (03CR) 10David Caro: [C: 03+2] cinderutils.ensure: take into account filesystem size [puppet] - 10https://gerrit.wikimedia.org/r/701405 (owner: 10David Caro) [16:27:44] (03PS1) 10Hnowlan: postgresql: don't get replica status if version is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/701428 [16:29:15] (03PS1) 10David Caro: cinderutils.ensure: use 11MB as filesystem space lost [puppet] - 10https://gerrit.wikimedia.org/r/701429 [16:30:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Math are hard." [puppet] - 10https://gerrit.wikimedia.org/r/701429 (owner: 10David Caro) [16:33:01] (03CR) 10David Caro: [C: 03+2] cinderutils.ensure: use 11MB as filesystem space lost [puppet] - 10https://gerrit.wikimedia.org/r/701429 (owner: 10David Caro) [16:34:10] (03PS1) 10Volans: idm: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701430 [16:49:17] (03PS2) 10Volans: idm: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701430 [16:51:44] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:52:47] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10Legoktm) What email address are you subscribed with? Feel free to email it to me if you don't want to post it publicly. If you know the address of the other person that wo... [16:53:04] (03PS2) 10Herron: kafka-logging: migrate logstash2001 broker to kafka-logging2001 [puppet] - 10https://gerrit.wikimedia.org/r/683012 (https://phabricator.wikimedia.org/T279342) [16:53:58] (03CR) 10Herron: [C: 03+2] kafka-logging: migrate logstash2001 broker to kafka-logging2001 [puppet] - 10https://gerrit.wikimedia.org/r/683012 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [16:55:11] jouncebot: now [16:55:12] For the next 0 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210624T1600) [16:56:15] volans: if you're still around, I could use help in understanding the test failure on https://integration.wikimedia.org/ci/job/tox-docker/19358/console [16:56:23] (03PS3) 10Volans: idm: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701430 [16:56:47] legoktm: sure, still around but not for too long [17:00:04] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210624T1700). [17:00:22] legoktm: so on line 216 we set the side effect of self.mocked_remote.query.return_value.run_sync [17:00:36] that means that each time run_sync is called you get the results from that list [17:00:46] first call 0, second call RemoteExecutionError [17:00:50] you added a third call in the middle [17:01:05] and mock knows only how to reply to 2 calls, not 3 [17:02:06] or more [17:02:11] checking the new code how many calls does [17:02:17] ahh, ok [17:02:20] (03CR) 10jerkins-bot: [V: 04-1] idm: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701430 (owner: 10Volans) [17:03:30] accraze: any services deployments happening during this window? i'm looking to re-roll group1 this morning following fixes from yesterday and wanted a little time for it to bake before rolling group2 during the normal train window [17:03:49] cc Amir1 ^ [17:03:59] volans: thanks, got it passing locally [17:04:04] addressing the rest of your CR now [17:04:04] ack [17:05:05] dduvall: nope, nothing on our end, you should be good to go! [17:05:12] right on. thanks! [17:08:01] !log re-rolling group1 to 1.37.0-wmf.11 (T281152) following deployment of blocker fixes (cc risky patch contacts Amir1 Krinkle DannyS712) [17:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:06] T281152: 1.37.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T281152 [17:09:07] (03PS1) 10Dduvall: group1 wikis to 1.37.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701437 [17:09:09] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.37.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701437 (owner: 10Dduvall) [17:10:08] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701437 (owner: 10Dduvall) [17:11:27] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.11 [17:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:33] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.11 (duration: 01m 06s) [17:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:25] AndyRussG: just saw that https://phabricator.wikimedia.org/T285449 was mentioned in the train block task but not added as a blocker. any concerns there for group2 promotion today? [17:15:38] (03PS1) 10Herron: add kafka-logging200[123] to kafka_brokers_logging [puppet] - 10https://gerrit.wikimedia.org/r/701440 (https://phabricator.wikimedia.org/T279342) [17:16:38] Amir1, Krinkle: thanks for the fixes and reviews yesterday. so far so good after group1 re-deploy, no frameCount related notices [17:17:55] volans: so if run_async() is actually async, do the sleeps in stop_periodic_jobs() actually work? [17:18:24] (03CR) 10Herron: [C: 03+2] add kafka-logging200[123] to kafka_brokers_logging [puppet] - 10https://gerrit.wikimedia.org/r/701440 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [17:18:50] legoktm: it's async in terms of different hosts, not commands, so each hosts will run all the commands one after the other, independenctly of what the other hosts are doing [17:18:59] with run_sync the first command is run everywhere, then the second, etc... [17:19:12] see https://wikitech.wikimedia.org/wiki/Cumin#Command_execution [17:19:51] gotcha [17:21:28] dduvall: hiii! Thanks, I guess I'd be pretty sure the answer is no? The only place I know of with errors is CI, though I guess I haven't checked the status of the beta cluster. The issue was a core change that CentralNotice wasn't ready for, so I imagine the block was just that the core change is expected on the train next week? [17:22:24] (03CR) 10Bartosz Dziewoński: [C: 04-1] "Blocked by T280599" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700675 (https://phabricator.wikimedia.org/T284339) (owner: 10Bartosz Dziewoński) [17:22:27] (03CR) 10Bartosz Dziewoński: [C: 04-1] "Blocked by T280599" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700677 (https://phabricator.wikimedia.org/T285162) (owner: 10Bartosz Dziewoński) [17:24:04] AndyRussG: hmm, it's unclear to me. was it a change in master before the wmf/1.37.0-wmf.11 branch point or after? [17:25:19] the comment is here https://phabricator.wikimedia.org/T281152#7174252 [17:25:41] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10CKoerner_WMF) My foundation staff account ckoerner@, the other user is kstinerowe@ [17:25:43] and the referenced patch is definitely present in wmf/1.37.0-wmf.11 since it merged on the 13th [17:25:43] (03PS4) 10Volans: idm: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701430 [17:26:13] AndyRussG: do you mind taking a closer look before the train window? i don't want to break more things this week :) [17:26:32] (03PS4) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [17:26:46] i already earned my t-shirt long ago, unless there's a new one... [17:26:51] (03PS5) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [17:26:59] (03CR) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [17:27:03] dduvall: this is the core change task https://phabricator.wikimedia.org/T277728 [17:27:57] https://gerrit.wikimedia.org/r/673311 [17:28:13] right [17:28:58] Core change task has this tag: MW-1.37-notes (1.37.0-wmf.11; 2021-06-21) [17:29:46] When is the window? I'm not actually at the keyboard just now... Or is that sufficient? [17:30:02] window is in 1.5 hours [17:30:26] Unless that core change were going out on a backport, I think we're safe [17:31:03] I didn't add the train block so I'm only guessing why it was added [17:32:47] Aaa it'd be not ready for me to get to the computer before then [17:32:58] *not easy [17:33:30] AndyRussG: i'm a bit confused, because you're saying CN is currently not ready for that change, but that change is included in wmf/1.37.0-wmf.11 [17:33:36] https://www.irccloud.com/pastebin/fMcXiqQK/ [17:34:45] Krinkle, James_F: any guidance? since i see you authored/submitted [17:35:02] dduvall: ok hmmm I guess I'm confused too. I'll go figure it out though [17:35:14] AndyRussG: thank you! [17:35:30] AndyRussG: I don't believe that issue affects CN code beyond unit tests [17:35:41] did we determine otherwise? [17:36:06] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [17:36:24] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:40] Krinkle: ah that could be [17:36:46] AndyRussG: the core change is in wmf.11, and wmf.11 is half-deployed already. [17:37:21] Right, I was somehow thinking that was next week's version [17:37:37] (03PS6) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [17:38:43] AndyRussG: I think the master patch to CN might be a problem though, but that's something we can fix before the next train. [17:38:48] left a CR comment [17:39:00] fwiw looks like the CN change was just CR+2d by elliot https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/701404 [17:39:01] Krinkle dduvall: so maybe the train block was just added because CI [17:39:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:39:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:42] alrighty! thanks for sussing it out [17:39:43] (03PS1) 10Volans: logoutd: add support for Python 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) [17:39:56] AndyRussG: Danny mentioned it on the task to ask you/me whether it affects CN, it wasn't set as a blocker (yet). [17:40:10] (03CR) 10jerkins-bot: [V: 04-1] logoutd: add support for Python 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) (owner: 10Volans) [17:40:12] If the Config object was missing in prod we would have seen an error by now. [17:41:00] got it. thanks Krinkle and AndyRussG [17:41:18] Krinkle: yes indeed [17:41:35] (03PS2) 10Volans: logoutd: add support for Python 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) [17:41:41] dduvall: thank u! [17:42:10] Krinkle: thanks much for the cr and clarification [17:42:53] (03CR) 10jerkins-bot: [V: 04-1] logoutd: add support for Python 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) (owner: 10Volans) [17:43:32] I'll fix the fix in today or tomorrow then Krinkle [17:43:40] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:44:00] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:04] (aaargh phone keyboards....) [17:44:38] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:44:41] (03PS2) 10Herron: kafka-logging: migrate logstash2002 broker to kafka-logging2002 [puppet] - 10https://gerrit.wikimedia.org/r/683013 (https://phabricator.wikimedia.org/T279342) [17:46:40] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10wkandek) [17:50:15] (03PS7) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [17:52:16] (03CR) 10Herron: [C: 03+2] kafka-logging: migrate logstash2002 broker to kafka-logging2002 [puppet] - 10https://gerrit.wikimedia.org/r/683013 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [17:52:44] (03PS3) 10Volans: logoutd: add support for Python 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) [17:55:00] (03CR) 10Volans: "Compiler seems happy: https://puppet-compiler.wmflabs.org/compiler1003/29992/" [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) (owner: 10Volans) [17:55:06] (03PS8) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [17:57:13] (03PS3) 10Legoktm: sre.switchdc.mediawiki: Update for periodic job changes in spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) [17:57:39] (03CR) 10Legoktm: "PS7-8: restore the per-dc handling of all commands so that we can avoid running in the active DC during live test mode." [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [17:59:26] PROBLEM - Hadoop NodeManager on an-worker1130 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210624T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:08] RECOVERY - Hadoop NodeManager on an-worker1130 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:02:14] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:41] (03PS9) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [18:04:03] (03CR) 10Legoktm: "PS9: Reorganized stuff after multiple rounds of refactoring to minimize the diff." [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [18:06:46] (03PS5) 10Ottomata: Add a consumers.analytics-hadoop setting to automate ingestion of streams into HDFS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668124 (https://phabricator.wikimedia.org/T273901) [18:09:34] (03CR) 10Volans: "quick question inline, LGTM otherwise" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [18:12:26] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:12:59] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:40] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:44] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:49] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:37] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10CDanis) a:05Papaul→03None Unfortunately I won't have time to work on this before going on leave, but it seems like it might not be a bad task for @cmooney to learn some ab... [18:20:49] (03CR) 10Legoktm: sre.switchdc.mediawiki: Update for periodic job changes in spicerack (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [18:21:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:48] (03CR) 10RLazarus: sre.switchdc.mediawiki: Update for periodic job changes in spicerack (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [18:22:22] (03PS2) 10Herron: kafka-logging: migrate logstash2003 broker to kafka-logging2003 [puppet] - 10https://gerrit.wikimedia.org/r/683014 (https://phabricator.wikimedia.org/T279342) [18:22:24] (03CR) 10RLazarus: [C: 03+1] "Haven't tested this yet, but pending the exercise today it looks good. Appreciate the clear interstitial comments in those pipelines!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [18:23:09] (03CR) 10RLazarus: [C: 03+1] mediawiki: Port mw-cli-wrapper to Python [puppet] - 10https://gerrit.wikimedia.org/r/701164 (owner: 10Legoktm) [18:23:57] (03CR) 10RLazarus: [C: 03+1] mediawiki: mw-cli-wrapper: Only run if read only in confctl is false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701055 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [18:28:46] (03CR) 10Herron: [C: 03+2] kafka-logging: migrate logstash2003 broker to kafka-logging2003 [puppet] - 10https://gerrit.wikimedia.org/r/683014 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [18:31:30] 10SRE, 10Performance-Team, 10serviceops, 10MW-1.36-notes, and 3 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10Krinkle) ### Background 12 local tabs with: ` $ watch -n0 "curl 'https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page' -H '... [18:31:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Cmjohnson) Updated all the netbox port information, added the 2nd interface, and connected to cloud-storage vlan. Named interfac... [18:33:22] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:52:06] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) Haven't had an update in a while from them, I just pinged them again. [18:56:17] (03CR) 10Legoktm: mediawiki: mw-cli-wrapper: Only run if read only in confctl is false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701055 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [18:56:23] (03PS4) 10Legoktm: mediawiki: mw-cli-wrapper: Only run if read only in confctl is false [puppet] - 10https://gerrit.wikimedia.org/r/701055 (https://phabricator.wikimedia.org/T266717) [18:56:33] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] marxarelli and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210624T1900). [19:00:46] (03CR) 10Legoktm: [C: 03+2] mediawiki: Port mw-cli-wrapper to Python [puppet] - 10https://gerrit.wikimedia.org/r/701164 (owner: 10Legoktm) [19:00:49] (03CR) 10RLazarus: sre.switchdc.mediawiki: Update for periodic job changes in spicerack (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [19:02:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:57] (03CR) 10Legoktm: [C: 03+2] mediawiki: mw-cli-wrapper: Only run if read only in confctl is false [puppet] - 10https://gerrit.wikimedia.org/r/701055 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [19:03:04] (03PS5) 10Legoktm: mediawiki: mw-cli-wrapper: Only run if read only in confctl is false [puppet] - 10https://gerrit.wikimedia.org/r/701055 (https://phabricator.wikimedia.org/T266717) [19:04:16] !log preparing to roll group2 to 1.37.0-wmf.11 (T281152) (cc risky patch contacts Amir1 Krinkle DannyS712) [19:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:22] T281152: 1.37.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T281152 [19:05:29] er, deploy-promote is confused about the current group [19:06:12] oh that's right. i have to give it an explicit 'all' [19:06:20] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.11 [19:06:21] * dduvall grumbles at our inconsistent tooling [19:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:31] ah, that's what I do. I thought you meant the group it was currently on [19:06:33] 10SRE, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) 05Open→03Resolved https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further efforts to be tracked in... [19:06:33] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_db_lag_stats_reporter.service,mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:35] sorry for the noise all. i'll re-run in a sec [19:07:25] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.11 (duration: 01m 04s) [19:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:38] jeena: yeah. seems weird that group1 is the only instance where you let the script figure it out on its own but i should stop complaining and change that :) [19:08:07] (03PS1) 10Dduvall: all wikis to 1.37.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701448 [19:08:09] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.37.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701448 (owner: 10Dduvall) [19:08:14] Error: Facter: error while resolving custom fact "postgres_replica_initialised": undefined method `match' for nil:NilClass [19:08:14] Did you mean? catch [19:08:22] I think that's a newer feature, which I've actually never used...I still put in the group I want to promote to since it makes me feel better :P [19:08:37] i see. i like that better [19:08:54] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701448 (owner: 10Dduvall) [19:09:09] i'll edit the docs to suggest an explicit usage [19:09:36] https://gerrit.wikimedia.org/r/c/operations/puppet/+/701428/ is the fix [19:10:35] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.11 [19:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:46] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10Legoktm) {F34525978} Well that certainly looks wrong... [19:21:56] (03CR) 10Volans: "One minor thing to check/fix in the tests and a question inline" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [19:21:57] 10SRE, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Legoktm) >>! In T202061#7175872, @lmata wrote: > https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further effo... [19:26:34] dduvall: I just got home after long travel, I will check the train blocker or the logspam ASAP [19:27:11] Amir1: thanks! i haven't seen anything so far [19:27:52] I mean this one T285490 :D [19:27:52] T285490: InvalidArgumentException: Media handler BmpHandler returned NULL for metadata, should be array - https://phabricator.wikimedia.org/T285490 [19:28:23] ah, right :) [19:28:55] i didn't add it as a blocker since i've only seen one log entry since group1 re-deploy [19:29:18] (03PS1) 10Andrew Bogott: Add config file for the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701455 (https://phabricator.wikimedia.org/T170355) [19:30:00] yeah, it's just a very weird edge case [19:30:03] (03CR) 10jerkins-bot: [V: 04-1] Add config file for the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701455 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:31:38] (03PS2) 10Andrew Bogott: Add config file for the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701455 (https://phabricator.wikimedia.org/T170355) [19:32:14] (03CR) 10jerkins-bot: [V: 04-1] Add config file for the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701455 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:32:25] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10Ladsgroup) lol [19:34:41] (03PS10) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [19:34:43] (03CR) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [19:35:33] (03PS3) 10Andrew Bogott: Add config file for the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701455 (https://phabricator.wikimedia.org/T170355) [19:35:43] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:07] (03PS11) 10Legoktm: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) [19:42:07] (03CR) 10Jbond: "LGTM, some none blocking comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/701428 (owner: 10Hnowlan) [19:42:12] (03CR) 10Jbond: [C: 03+1] postgresql: don't get replica status if version is unavailable [puppet] - 10https://gerrit.wikimedia.org/r/701428 (owner: 10Hnowlan) [19:42:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701430 (owner: 10Volans) [19:43:02] (03PS4) 10Legoktm: sre.switchdc.mediawiki: Update for periodic job changes in spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) [19:43:16] (03CR) 10Legoktm: sre.switchdc.mediawiki: Update for periodic job changes in spicerack (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [19:43:28] (03PS5) 10Legoktm: sre.switchdc.mediawiki: Update for periodic job changes in spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) [19:43:40] (03CR) 10Andrew Bogott: [C: 03+2] Add config file for the disable_tool script [puppet] - 10https://gerrit.wikimedia.org/r/701455 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:44:00] 10SRE, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) 05Resolved→03Open > Is there documentation about this somewhere? E.g. what does "Editing" mean, which API is "API" about, what developer tools are being watch... [19:44:08] (03CR) 10Legoktm: "PS11: Split out check_systemd_timers_enabled() method based on discussion in the cookbooks patch." [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [19:48:18] (03CR) 10Jbond: [C: 03+1] "LGTM but see Q" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) (owner: 10Volans) [19:50:07] (03PS1) 10Andrew Bogott: profile::wmcs::nfs::primary: Fix exchanged ldap pass and dn [puppet] - 10https://gerrit.wikimedia.org/r/701458 (https://phabricator.wikimedia.org/T170355) [19:53:44] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfs::primary: Fix exchanged ldap pass and dn [puppet] - 10https://gerrit.wikimedia.org/r/701458 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [19:53:51] (03CR) 10Volans: [C: 03+2] idm: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701430 (owner: 10Volans) [19:54:17] (03PS3) 10Bartosz Dziewoński: Remove redundant wgDiscussionToolsEnable overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300 (owner: 10Esanders) [19:54:23] (03PS4) 10Bartosz Dziewoński: Enable DiscussionTools' topicsubscription as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) [19:56:13] (03PS1) 10Andrew Bogott: profile::wmcs::nfs::primary: more c/p fixes [puppet] - 10https://gerrit.wikimedia.org/r/701459 [19:57:26] (03Merged) 10jenkins-bot: idm: fix typo in docstring [software/pywmflib] - 10https://gerrit.wikimedia.org/r/701430 (owner: 10Volans) [19:57:47] (03CR) 10Volans: "replied to questions" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/701442 (https://phabricator.wikimedia.org/T283242) (owner: 10Volans) [19:58:28] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfs::primary: more c/p fixes [puppet] - 10https://gerrit.wikimedia.org/r/701459 (owner: 10Andrew Bogott) [20:01:34] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [20:01:55] (03CR) 10Volans: [C: 03+1] "LGTM, let's test it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [20:03:00] (03CR) 10Legoktm: [C: 03+2] mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [20:06:32] (03CR) 10RLazarus: [C: 03+1] sre.switchdc.mediawiki: Update for periodic job changes in spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [20:09:13] (03PS3) 10Legoktm: sre.switchdc.mediawiki: Warm up caches in api_appserver cluster too [cookbooks] - 10https://gerrit.wikimedia.org/r/700704 (https://phabricator.wikimedia.org/T269179) [20:09:29] (03CR) 10Legoktm: "> Patch Set 2:" [cookbooks] - 10https://gerrit.wikimedia.org/r/700704 (https://phabricator.wikimedia.org/T269179) (owner: 10Legoktm) [20:09:53] (03Merged) 10jenkins-bot: mediawiki: Update cronjob code now that most are systemd timers [software/spicerack] - 10https://gerrit.wikimedia.org/r/701053 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [20:11:18] (03PS2) 10Cwhite: logstash: transition openstack to ECS [puppet] - 10https://gerrit.wikimedia.org/r/699039 (https://phabricator.wikimedia.org/T234565) [20:13:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Andrew) ens3f0np0 and ens3f1np1 look right to me, although I won't know for sure until we see what debian calls them. [20:14:35] (03CR) 10Cwhite: [C: 03+2] logstash: transition openstack to ECS [puppet] - 10https://gerrit.wikimedia.org/r/699039 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [20:24:39] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.55 [software/spicerack] - 10https://gerrit.wikimedia.org/r/701460 [20:28:14] !log re-enabled daily digests for wikimedia-l - T285486 [20:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:20] T285486: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 [20:28:35] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10Legoktm) > The last digest email I have received is from June 18 ` MariaDB [mailman3]> select list_id, digests_enabled, digest_last_sent_at from mailinglist where list_id... [20:30:40] (03CR) 10Volans: [V: 03+2 C: 03+2] CHANGELOG: add changelogs for release v0.0.55 [software/spicerack] - 10https://gerrit.wikimedia.org/r/701460 (owner: 10Volans) [20:31:29] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10Legoktm) Unfortunately you're not going to receive copies of the missed messages while digests were disabled, you'll need to use the web archive to catch up. And I think... [20:33:00] (03PS1) 10Volans: Upstream release v0.0.55 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/701461 [20:39:25] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.55 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/701461 (owner: 10Volans) [20:40:39] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [20:44:02] !log uploaded spicerack_0.0.55 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [20:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:25] 10SRE, 10Wikimedia-Mailing-lists: Wikimedia-l Digests no longer received as of June 18, 2021 - https://phabricator.wikimedia.org/T285486 (10Ijon) I... don't! Thanks for looking into this. I //did// make some list config updates about a week ago (e.g. added a requested link to the archives to the footer), and,... [20:46:57] (03CR) 10Legoktm: [C: 03+2] sre.switchdc.mediawiki: Update for periodic job changes in spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [20:49:52] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Update for periodic job changes in spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/701219 (https://phabricator.wikimedia.org/T266717) (owner: 10Legoktm) [20:53:32] !log legoktm@phab1001:~$ sudo /srv/phab/phabricator/bin/remove destroy M320 (spam) [20:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:52] (03PS1) 10Krinkle: purgeParserCache.php: Implement --tag for purging one server only [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701408 (https://phabricator.wikimedia.org/T282761) [20:58:51] !log starting dry run and live test of DC switchover [20:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:54] (03CR) 10Krinkle: "Scap trap: Sync extending class (SqlBag) before the rest in libs/objectcache. https://3v4l.org/iCSgP#focus=7.2.22. Then maintenance/ last." [core] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701408 (https://phabricator.wikimedia.org/T282761) (owner: 10Krinkle) [21:10:43] 10SRE, 10observability: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) >>! In T202061#7175886, @Legoktm wrote: >>>! In T202061#7175872, @lmata wrote: >> https://wikimedia.statuspage.io/ is live and will continue to see more develop... [21:47:33] !log live hacked spicerack on cumin1001 to revert https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/700963/ [21:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:33] (03PS4) 10Legoktm: sre.switchdc.mediawiki: Warm up caches in api_appserver cluster too [cookbooks] - 10https://gerrit.wikimedia.org/r/700704 (https://phabricator.wikimedia.org/T269179) [21:51:42] (03CR) 10Legoktm: [C: 03+2] sre.switchdc.mediawiki: Warm up caches in api_appserver cluster too [cookbooks] - 10https://gerrit.wikimedia.org/r/700704 (https://phabricator.wikimedia.org/T269179) (owner: 10Legoktm) [21:54:40] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Warm up caches in api_appserver cluster too [cookbooks] - 10https://gerrit.wikimedia.org/r/700704 (https://phabricator.wikimedia.org/T269179) (owner: 10Legoktm) [21:54:50] (03PS3) 10Krinkle: InitialiseSettings: Change wgEntitySchemaShExSimpleUrl to toolforge.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701093 (https://phabricator.wikimedia.org/T285364) [21:59:00] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [21:59:02] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [21:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:25] okay then [22:00:11] we're testing the switchover in live test mode, so it's going from codfw -> eqiad [22:00:57] and skipping potentially bad things for eqiad or running them against codfw (like cache warmup) [22:00:58] and will ! log to SAL creating some "spam" [22:01:43] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [22:01:46] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [22:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:00] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [22:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:23] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [22:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:00] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-warmup-caches [22:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:45] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) [22:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:29] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [22:09:31] !log legoktm@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) [22:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:17] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [22:27:37] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [22:29:25] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [22:29:26] !log legoktm@cumin1001 [DRY-RUN] MediaWiki read-only period starts at: 2021-06-24 22:29:25.643909 [22:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:36] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [22:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:50] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [22:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:22] !log legoktm@cumin1001 END (ERROR) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=97) [22:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:58] PROBLEM - MariaDB read only x2 #page on db2142 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.15-MariaDB-log, Uptime 10569802s, event_scheduler: True, 16.55 QPS, connection latency: 0.004240s, query latency: 0.000567s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:34:46] ^ caused by switchover testing, context in -sre [22:35:44] volans is setting read_only back off [22:35:54] RECOVERY - MariaDB read only x2 #page on db2142 is OK: Version 10.4.15-MariaDB-log, Uptime 10569919s, read_only: False, event_scheduler: True, 30.21 QPS, connection latency: 0.003913s, query latency: 0.000564s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:36:07] 📝 [22:36:09] !log set x2 codfw master back to RW [22:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:05] PROBLEM - Host ping3001 is DOWN: PING CRITICAL - Packet loss = 100% [22:45:23] RECOVERY - Host ping3001 is UP: PING OK - Packet loss = 0%, RTA = 107.11 ms [22:55:05] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [22:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:32] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [22:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:41] !log legoktm@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [22:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:25] !log legoktm@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [22:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:22] I think we're done with switch testing for today [23:00:04] brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for US Backport and Config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210624T2300). [23:02:56] !log reverted cumin1001 spicerack live hacks [23:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:59] 10SRE: Connecting to https://api.svc.codfw.wmnet/ does not work - https://phabricator.wikimedia.org/T285517 (10Legoktm) [23:18:34] (03PS1) 10Legoktm: Revert "mediawiki: Make siteinfo API request over HTTPS" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701409 (https://phabricator.wikimedia.org/T285517) [23:18:41] (03PS2) 10Legoktm: Revert "mediawiki: Make siteinfo API request over HTTPS" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701409 (https://phabricator.wikimedia.org/T285517) [23:18:46] (03CR) 10Legoktm: [C: 03+2] Revert "mediawiki: Make siteinfo API request over HTTPS" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701409 (https://phabricator.wikimedia.org/T285517) (owner: 10Legoktm) [23:24:16] 10SRE, 10serviceops, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Legoktm) [23:24:18] 10SRE, 10Patch-For-Review: Connecting to https://api.svc.codfw.wmnet/ does not work - https://phabricator.wikimedia.org/T285517 (10Legoktm) [23:24:23] (03Merged) 10jenkins-bot: Revert "mediawiki: Make siteinfo API request over HTTPS" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701409 (https://phabricator.wikimedia.org/T285517) (owner: 10Legoktm) [23:24:39] 10SRE, 10serviceops, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Legoktm) >>! In T266618#7170274, @Legoktm wrote: >> it's connecting to port 80 with the x-forwarded-proto header, and that should probably be updated. > > This is easy, I'll... [23:26:30] 10SRE, 10serviceops, 10Datacenter-Switchover: Various services hardcode api.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T285518 (10Legoktm) p:05Triage→03High [23:33:45] 10SRE, 10DBA, 10Datacenter-Switchover: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) p:05Triage→03High [23:38:43] 10SRE, 10Datacenter-Switchover: Hide `systemctl is-enabled` output in switchover cookbooks - https://phabricator.wikimedia.org/T285520 (10Legoktm) p:05Triage→03Low [23:44:41] 10SRE, 10Datacenter-Switchover: --live-test mode of switchdc cookbook should auto downtime "High average GET latency" alerts - https://phabricator.wikimedia.org/T285521 (10Legoktm) p:05Triage→03Low [23:49:41] 10SRE, 10SRE-tools, 10Spicerack, 10Datacenter-Switchover: switchdc: systemctl disable command failed, because units were already gone - https://phabricator.wikimedia.org/T285524 (10Legoktm) p:05Triage→03High [23:50:42] (03CR) 10Legoktm: "If the new message looks good, I'd like to merge + deploy this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/701052 (https://phabricator.wikimedia.org/T285373) (owner: 10Legoktm) [23:55:02] (03PS1) 10Tim Starling: Update src/defines.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701467 [23:56:22] (03CR) 10jerkins-bot: [V: 04-1] Update src/defines.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701467 (owner: 10Tim Starling) [23:56:43] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook