[00:06:38] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 63 probes of 660 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:42:14] RECOVERY - Check systemd state on apifeatureusage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:00] RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:48:52] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:50] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 90%, RTA = 6047.97 ms [01:02:50] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:09:12] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.90 ms [01:09:12] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 231.23 ms [01:40:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:00:05] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml: move wikidumpparse to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762133 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [02:10:10] (03PS1) 10Andrew Bogott: nfs-mounts.yaml: move wikilink to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762137 (https://phabricator.wikimedia.org/T301280) [02:18:18] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:22:07] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [02:25:00] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml: move wikilink to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/762137 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [03:58:40] (03PS1) 104nn1l2: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) [04:00:06] (03PS2) 104nn1l2: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) [04:00:52] (03CR) 10jerkins-bot: [V: 04-1] InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [04:08:47] (03PS3) 104nn1l2: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) [04:17:26] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10lmata) 05Open→03Resolved About a year ago, the #sre_observability team sent out to mitigate risk fo... [04:19:07] (03PS10) 104nn1l2: Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [04:20:48] (03CR) 104nn1l2: Enable ULS webfonts by default on trwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [04:54:49] (03CR) 10Santhosh: [C: 03+1] Enable SectionTranslation in Occitan and Luganda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761626 (https://phabricator.wikimedia.org/T301443) (owner: 10KartikMistry) [04:55:11] (03CR) 10Santhosh: [C: 03+1] Fixed typo for SectionTranslation in testwiki: lu -> lg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761937 (owner: 10KartikMistry) [05:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:53:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:53:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:14] !log Deploy schema change on s5 master (db1130) T300775 [05:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:18] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [05:58:05] (03PS1) 10Marostegui: db1183: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762145 (https://phabricator.wikimedia.org/T301219) [06:02:15] (03CR) 10Marostegui: [C: 03+2] db1183: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762145 (https://phabricator.wikimedia.org/T301219) (owner: 10Marostegui) [06:03:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:03:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:03:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [06:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [06:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:44] (03PS1) 10Marostegui: mariadb: Promote db1183 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/762146 (https://phabricator.wikimedia.org/T301219) [06:07:52] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/762146 (https://phabricator.wikimedia.org/T301219) (owner: 10Marostegui) [06:22:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:22:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T300382)', diff saved to https://phabricator.wikimedia.org/P20629 and previous config saved to /var/cache/conftool/dbconfig/20220214-062219-marostegui.json [06:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:25] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [06:32:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300382)', diff saved to https://phabricator.wikimedia.org/P20630 and previous config saved to /var/cache/conftool/dbconfig/20220214-063204-marostegui.json [06:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:10] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [06:47:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P20631 and previous config saved to /var/cache/conftool/dbconfig/20220214-064709-marostegui.json [06:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:18] RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:02:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P20632 and previous config saved to /var/cache/conftool/dbconfig/20220214-070214-marostegui.json [07:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300382)', diff saved to https://phabricator.wikimedia.org/P20633 and previous config saved to /var/cache/conftool/dbconfig/20220214-071718-marostegui.json [07:17:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:17:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:24] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [07:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:26:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:10] (03Abandoned) 10Marostegui: mariadb: new dbproxy role & profile [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) (owner: 10Banyek) [07:33:44] (03Abandoned) 10Marostegui: mariadb: Add tokudb support for analytics eventlogging nodes [puppet] - 10https://gerrit.wikimedia.org/r/356648 (owner: 10Jcrespo) [07:35:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:35:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:35:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T300382)', diff saved to https://phabricator.wikimedia.org/P20634 and previous config saved to /var/cache/conftool/dbconfig/20220214-073544-marostegui.json [07:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:51] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [07:37:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [07:40:56] (03Merged) 10jenkins-bot: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [07:43:41] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2007.codfw.wmnet with OS bullseye [07:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300382)', diff saved to https://phabricator.wikimedia.org/P20635 and previous config saved to /var/cache/conftool/dbconfig/20220214-074529-marostegui.json [07:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:34] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [07:47:41] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762392 (owner: 10Awight) [07:48:22] !log installing expat security updates [07:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:52] (03PS1) 10Elukey: icinga: remove ORES unused alerts and improve the remaining ones [puppet] - 10https://gerrit.wikimedia.org/r/762393 [07:52:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33769/console" [puppet] - 10https://gerrit.wikimedia.org/r/762393 (owner: 10Elukey) [07:56:42] !log restart blazegraph on wdqs1013 (jvm stuck for 26h) [07:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:15] (03CR) 10Elukey: [V: 03+1] "elukey@alert1001:~$ sudo /usr/local/lib/nagios/plugins/check_ores_workers ores.wikimedia.org ores.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/762393 (owner: 10Elukey) [08:00:04] Amir1, awight, and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220214T0800). [08:00:04] Ideophagous, kart_, and nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:27] welcome to the morning window [08:00:29] * kart_ is here [08:00:32] hello [08:00:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P20636 and previous config saved to /var/cache/conftool/dbconfig/20220214-080034-marostegui.json [08:00:36] I can deploy today [08:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:49] I'm also around [08:00:53] (totally forgot the new timing) [08:01:17] is it possible to deploy this patch too? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/761878 [08:01:18] urbanecm: cool. Thanks! [08:01:50] (03PS4) 10Urbanecm: arywiki: Add Portal and Draft namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761708 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [08:02:03] taavi: me too. But jouncebot is here to remind us :) [08:02:55] Ideophagous: the core patch needs to be reviewed and merged first, and then it will be automatically deployed with the next train (roughly, it means "next Thursday" for Wikipedias) [08:03:20] (03CR) 10Thiemo Kreuz (WMDE): "Personally, I would do this the other way around: Remove the hard-coded Wikimedia-specific default from the extension's extension.json (re" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762392 (owner: 10Awight) [08:03:36] (03CR) 10Urbanecm: [C: 03+2] arywiki: Add Portal and Draft namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761708 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [08:03:38] urbanecm: so I don't have to schedule it in this case? [08:03:48] Ideophagous: nope, just wait for someone to review it 🙂 [08:03:53] OK [08:04:17] (I can do it later, but now I need to focus on the window :-)) [08:04:20] (03Merged) 10jenkins-bot: arywiki: Add Portal and Draft namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761708 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [08:04:35] urbanecm: Sure thing, thanks! [08:06:04] Ideophagous: your patch is pulled to mwdebug1001. Can you verify it works, please? [08:06:35] how should I verify? [08:06:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:07:00] You need to have the browser extension installed (https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_usage). Click its icon, enable it, ensure mwdebug1001.eqiad.wmnet is the server selected and go to ary.wikipedia.org. Ensure the namespaces you created exist. [08:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:07] Ideophagous: see above [08:07:27] OK [08:08:44] one of the ways to verify namespace existence is going to Special:AllPages, and checking the dropdown [08:09:22] is there a specific location where I should check the namespaces? I don't see Draft or Portal on the list when I try to transfer a page or in the advanced search options? [08:09:26] (03CR) 10Ayounsi: [C: 03+1] Change elukey's ssh public key [homer/public] - 10https://gerrit.wikimedia.org/r/761281 (owner: 10Elukey) [08:09:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:09:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:34] PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 9.66e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:38] OK [08:09:53] Ideophagous: AFAIK, the search interface caches this, so it will not update immediately [08:10:10] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: remove ORES unused alerts and improve the remaining ones [puppet] - 10https://gerrit.wikimedia.org/r/762393 (owner: 10Elukey) [08:10:29] godog: <3 [08:10:37] (03CR) 10Elukey: [V: 03+1 C: 03+2] icinga: remove ORES unused alerts and improve the remaining ones [puppet] - 10https://gerrit.wikimedia.org/r/762393 (owner: 10Elukey) [08:10:46] Ideophagous: although I do see it in the search dropdown too. Are you sure the mwdebug extension is enabled? [08:10:57] it should look like this https://usercontent.irccloud-cdn.com/file/qexRZ0Bp/image.png [08:10:58] yes, I enabled it [08:11:06] elukey: hehe LGTM! less code is always better [08:11:13] (03CR) 10Ayounsi: "Indeed, not supported on older Junos and we're a few years from upgrading/refreshing all our switch stacks." [homer/public] - 10https://gerrit.wikimedia.org/r/760937 (owner: 10Elukey) [08:11:27] urbanecm: should I enable all options? I only enabled GUI and verbose [08:12:05] Ideophagous: you don't need to enable any of those (they're for advanced use-cases). See my screenshot at https://usercontent.irccloud-cdn.com/file/qexRZ0Bp/image.png for how it should look like. [08:12:07] Oh, I see it now [08:12:31] All of the options can (and should) be disabled. But the on/off slider needs to be on 🙂 [08:12:38] (and the right server needs to be picked) [08:13:06] urbanecm: yep, it works. Thanks! When should we expect the change in the live version? [08:13:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:13:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2007.codfw.wmnet with OS bullseye [08:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:32] Ideophagous: momentarily :-). Since you verified it works (and it works fine when I try it as well), I'll sync it to production now [08:13:43] OK, great! [08:15:17] kart_: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/761937 is marked as depending on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/761626/2, an undeployed (and unscheduled) patch. [08:15:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: db0e71e8e69c81b766890c26b83d071b535ca871: arywiki: Add Portal and Draft namespaces (T291737) (duration: 00m 52s) [08:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:27] Ideophagous: it's live now [08:15:29] T291737: Request adding and updating namespaces on arywiki - https://phabricator.wikimedia.org/T291737 [08:15:39] kart_: should we deploy both, or rebase so https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/761937 doesn't depend on anything? [08:15:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P20637 and previous config saved to /var/cache/conftool/dbconfig/20220214-081538-marostegui.json [08:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:57] urbanecm: rebase and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/761937 only please. [08:17:30] kart_: no problem. In that case, can you do the rebase (while I deploy a different patch)? Gerrit says rebase will conflict [08:17:38] (03PS11) 10Urbanecm: Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [08:17:49] Sure [08:18:14] actually, nn1l2 is not here it looks? [08:18:15] urbanecm: thank you! There's a problem though: the Portal pages we created before are now empty. Especially one portal we spent a lot of time working on, we would definitely not want to lose it. [08:18:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:18:31] Ideophagous: yes, that'll be fixed soon. I need to run a specific script to fix that [08:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:36] thanks for reminding me to do it [08:18:44] OK, thanks again! [08:18:53] urbanecm: I added a last-minute patch too [08:19:05] taavi: ack. Self-service at the end is ok? :-) [08:19:11] yes please [08:19:22] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=arywiki --fix # T291737 [08:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:26] taavi: will ping you when finished then [08:19:35] Ideophagous: should be all fixed now. [08:19:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:19:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:06] (03CR) 10Urbanecm: prod: WRITE_NEW for CentralAuth hidden level migration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [08:21:12] urbanecm: seems stuck in rebase hell. Give me few more minutes. [08:21:39] kart_: absolutely, take your time. [08:22:05] (03PS2) 10Majavah: prod: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) [08:22:16] (03CR) 10Majavah: prod: WRITE_NEW for CentralAuth hidden level migration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [08:22:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:54] (03CR) 10Ayounsi: "I'd suggest adding 1 host from each rack, as well as 1 mgmt host." [puppet] - 10https://gerrit.wikimedia.org/r/760616 (https://phabricator.wikimedia.org/T282788) (owner: 10BBlack) [08:23:30] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:50] (03CR) 10Ayounsi: [C: 03+1] Add netflow6001 to kafka custom ferm [puppet] - 10https://gerrit.wikimedia.org/r/760613 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [08:25:12] (03CR) 10Ayounsi: [C: 03+1] "Good catch, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/760617 (https://phabricator.wikimedia.org/T186650) (owner: 10BBlack) [08:26:17] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10MatthewVernon) @cmooney thanks for that - if it's next week or so, I'm happy to wait. [background: these 4 hosts are h/w refresh for swift frontends. So I c... [08:28:11] (03PS2) 10KartikMistry: Fixed typo for SectionTranslation in testwiki: lu -> lg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761937 [08:28:47] urbanecm: done now :) [08:28:55] kart_: let's deploy it! [08:29:21] (03CR) 10Urbanecm: [C: 03+2] Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [08:29:25] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2008.codfw.wmnet with OS bullseye [08:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:34] wrong patch [08:29:45] (03CR) 10Urbanecm: [C: 03+2] Fixed typo for SectionTranslation in testwiki: lu -> lg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761937 (owner: 10KartikMistry) [08:29:54] Oops! :) [08:30:01] cancelled in time, fortunately [08:30:09] I'll ping you when it's at mwdebug for testing [08:30:25] Sure [08:30:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300382)', diff saved to https://phabricator.wikimedia.org/P20638 and previous config saved to /var/cache/conftool/dbconfig/20220214-083043-marostegui.json [08:30:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:30:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:30:47] (03Merged) 10jenkins-bot: Fixed typo for SectionTranslation in testwiki: lu -> lg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761937 (owner: 10KartikMistry) [08:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:49] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [08:30:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T300382)', diff saved to https://phabricator.wikimedia.org/P20639 and previous config saved to /var/cache/conftool/dbconfig/20220214-083051-marostegui.json [08:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:47] urbanecm: question, why didn't the article count decrease as a result of the creation of the Portal namespace? I was expecting that to happen, since the portals were counted as articles originally. [08:32:35] kart_: pulled to mwdebug1001, can you check? [08:32:47] Ideophagous: because that's updated in a background job that didn't have time to run so far 🙂 [08:32:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:29] Oh, OK. Thanks again! [08:33:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:33:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:24] kart_: how's testing going? 🙂 [08:36:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:00] urbanecm: a minute. [08:37:04] sure :) [08:38:52] urbanecm: all ok! [08:38:58] syncing! [08:39:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1b0daef196851bec1449aa1cdf4b111e6284c4e7: Fixed typo for SectionTranslation in testwiki: lu -> lg (duration: 00m 48s) [08:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:55] kart_: done! [08:40:02] !log UTC morning B&C window done [08:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:05] not yet [08:40:20] taavi: sorry. Floor is yours, of course [08:40:27] (03PS3) 10Majavah: prod: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) [08:40:28] !log Reopen UTC morning B&C for a last deploy [08:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:32] (03CR) 10Majavah: [C: 03+2] prod: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [08:40:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300382)', diff saved to https://phabricator.wikimedia.org/P20640 and previous config saved to /var/cache/conftool/dbconfig/20220214-084041-marostegui.json [08:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:46] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [08:41:35] (03Merged) 10jenkins-bot: prod: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761599 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [08:44:59] syncing [08:45:44] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:761599|prod: WRITE_NEW for CentralAuth hidden level migration (T289068)]] (duration: 00m 49s) [08:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:49] T289068: Normalise centralauth.gu_hidden - https://phabricator.wikimedia.org/T289068 [08:46:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:48:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:10] !log UTC morning deploys done (for real this time) [08:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:09] 10SRE-swift-storage, 10Observability-Logging, 10User-fgiunchedi: Missed swift log rotation can lead to full root filesystem - https://phabricator.wikimedia.org/T301657 (10fgiunchedi) [08:49:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:33] (03CR) 10Filippo Giunchedi: [C: 03+1] Add ops-drmrs to alertmanager config [puppet] - 10https://gerrit.wikimedia.org/r/760614 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [08:52:42] urbanecm: Forgot one thing. Thanks! :) [08:55:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P20641 and previous config saved to /var/cache/conftool/dbconfig/20220214-085546-marostegui.json [08:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:02] 10SRE-swift-storage, 10Observability-Logging, 10User-fgiunchedi: Missed swift log rotation can lead to full root filesystem - https://phabricator.wikimedia.org/T301657 (10fgiunchedi) [08:58:05] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10Damiendf) [08:58:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2008.codfw.wmnet with OS bullseye [08:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:17] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: add 00-load_modules.conf [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [09:10:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P20642 and previous config saved to /var/cache/conftool/dbconfig/20220214-091050-marostegui.json [09:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:17] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: disable shard rebalancing while deleting indices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761989 (owner: 10Cwhite) [09:13:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:13:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:06] (03CR) 10Filippo Giunchedi: [C: 03+2] network: make slice_network_constants work outside of module [puppet] - 10https://gerrit.wikimedia.org/r/761888 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:14:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:14:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298554)', diff saved to https://phabricator.wikimedia.org/P20643 and previous config saved to /var/cache/conftool/dbconfig/20220214-091422-ladsgroup.json [09:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:28] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [09:15:40] (03CR) 10Filippo Giunchedi: [C: 03+2] role: reorder prometheus profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/761910 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:15:45] (03PS5) 10Filippo Giunchedi: role: reorder prometheus profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/761910 (https://phabricator.wikimedia.org/T291946) [09:17:55] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add probes for "pop services" [puppet] - 10https://gerrit.wikimedia.org/r/761889 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:20:18] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:02] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: split probes by network sphere and address family (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:24:54] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [09:25:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300382)', diff saved to https://phabricator.wikimedia.org/P20644 and previous config saved to /var/cache/conftool/dbconfig/20220214-092555-marostegui.json [09:25:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:25:57] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:25:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:02] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [09:26:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T300382)', diff saved to https://phabricator.wikimedia.org/P20645 and previous config saved to /var/cache/conftool/dbconfig/20220214-092602-marostegui.json [09:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:51] (03PS12) 104nn1l2: trwikisource: Enable ULS webfonts by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [09:27:03] (03PS13) 104nn1l2: trwikisource: Enable ULS webfonts by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [09:29:18] (03PS1) 10JMeybohm: Update vendor to latest wmf branch of cfssl [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/762402 (https://phabricator.wikimedia.org/T299906) [09:29:33] (03PS4) 104nn1l2: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) [09:30:33] (03CR) 10Awight: [C: 04-1] Rely on the default value for $wgFileExporterTarget (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762392 (owner: 10Awight) [09:31:09] (03CR) 104nn1l2: "Per our discussion at IRC, I'm adding to the reviewers." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [09:33:50] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update vendor to latest wmf branch of cfssl [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/762402 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [09:34:27] !log update haproxy to 2.4.12 on cp4026 - T290005 [09:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:32] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:36:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300382)', diff saved to https://phabricator.wikimedia.org/P20646 and previous config saved to /var/cache/conftool/dbconfig/20220214-093621-marostegui.json [09:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:26] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [09:39:08] (03PS1) 10JMeybohm: Update cfssl-issuer to v0.2.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/762403 [09:40:40] (03PS2) 10JMeybohm: Update cfssl-issuer to v0.2.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/762403 [09:40:46] !log update haproxy to 2.4.12 on cp4032 - T290005 [09:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:51] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:41:31] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update cfssl-issuer to v0.2.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/762403 (owner: 10JMeybohm) [09:43:24] (03PS5) 104nn1l2: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) [09:44:16] !log published image docker-registry.discovery.wmnet/cfssl-issuer:0.2.2-0 [09:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:43] (03PS6) 104nn1l2: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) [09:50:31] (03PS1) 10JMeybohm: Sync cfssl-issuer app and chart versions to latest release [deployment-charts] - 10https://gerrit.wikimedia.org/r/762405 [09:51:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P20647 and previous config saved to /var/cache/conftool/dbconfig/20220214-095125-marostegui.json [09:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298554)', diff saved to https://phabricator.wikimedia.org/P20648 and previous config saved to /var/cache/conftool/dbconfig/20220214-095622-ladsgroup.json [09:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:27] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [09:59:43] (03PS1) 10Filippo Giunchedi: wmflib: do not quote probe icmp address [puppet] - 10https://gerrit.wikimedia.org/r/762406 (https://phabricator.wikimedia.org/T291946) [10:01:28] (03PS2) 10JMeybohm: Sync cfssl-issuer app and chart versions to latest release [deployment-charts] - 10https://gerrit.wikimedia.org/r/762405 [10:02:50] (03CR) 10Filippo Giunchedi: [C: 03+2] wmflib: do not quote probe icmp address [puppet] - 10https://gerrit.wikimedia.org/r/762406 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:05:30] (03PS1) 10Filippo Giunchedi: role: move probe alerts to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/762407 (https://phabricator.wikimedia.org/T291946) [10:05:55] (03PS1) 10Filippo Giunchedi: sre: add probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/762408 (https://phabricator.wikimedia.org/T291946) [10:06:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P20649 and previous config saved to /var/cache/conftool/dbconfig/20220214-100630-marostegui.json [10:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:47] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1183 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/762146 (https://phabricator.wikimedia.org/T301219) (owner: 10Marostegui) [10:06:54] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.123e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:07:52] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:19] (03CR) 10jerkins-bot: [V: 04-1] sre: add probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/762408 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:08:24] (03PS1) 10JMeybohm: cfssl-issuer fix version in changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/762409 [10:08:39] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cfssl-issuer fix version in changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/762409 (owner: 10JMeybohm) [10:11:17] (03PS1) 10Elukey: profile::kubernetes::node: add grub settings for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T301412) [10:11:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P20650 and previous config saved to /var/cache/conftool/dbconfig/20220214-101126-ladsgroup.json [10:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:55] (03CR) 10jerkins-bot: [V: 04-1] profile::kubernetes::node: add grub settings for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T301412) (owner: 10Elukey) [10:12:00] uff [10:12:29] !log published image docker-registry.discovery.wmnet/cfssl-issuer:0.2.2-1 [10:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:34] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33770/console" [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T301412) (owner: 10Elukey) [10:12:40] (03PS1) 10Vgutierrez: cache::haproxy: Log X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762411 (https://phabricator.wikimedia.org/T290005) [10:12:50] (03PS2) 10Elukey: profile::kubernetes::node: add grub settings for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T301412) [10:17:08] (03PS3) 10Elukey: profile::kubernetes::node: add grub settings for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T301412) [10:18:00] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33771/console" [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T301412) (owner: 10Elukey) [10:21:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300382)', diff saved to https://phabricator.wikimedia.org/P20652 and previous config saved to /var/cache/conftool/dbconfig/20220214-102135-marostegui.json [10:21:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:21:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:41] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [10:21:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T300382)', diff saved to https://phabricator.wikimedia.org/P20653 and previous config saved to /var/cache/conftool/dbconfig/20220214-102142-marostegui.json [10:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:30] (03PS2) 10Filippo Giunchedi: sre: add probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/762408 (https://phabricator.wikimedia.org/T291946) [10:26:04] (03CR) 10Elukey: [V: 03+1] "Assumptions:" [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T301412) (owner: 10Elukey) [10:26:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P20654 and previous config saved to /var/cache/conftool/dbconfig/20220214-102631-ladsgroup.json [10:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:36] (03PS4) 10Elukey: profile::kubernetes::node: add grub settings for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T300744) [10:29:36] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/762408 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:29:42] (03CR) 10Filippo Giunchedi: [C: 03+2] role: move probe alerts to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/762407 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:30:07] (03PS1) 10Vgutierrez: mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) [10:31:42] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300382)', diff saved to https://phabricator.wikimedia.org/P20655 and previous config saved to /var/cache/conftool/dbconfig/20220214-103154-marostegui.json [10:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:00] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [10:32:02] (03CR) 10jerkins-bot: [V: 04-1] mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:32:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "`git show --patience --color-moved=dimmed-zebra` makes it easy to see that the whole block moved with no changes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761915 (https://phabricator.wikimedia.org/T301559) (owner: 10Matthias Mullie) [10:35:05] (03PS1) 10Filippo Giunchedi: hiera: expect 301 from text/upload/ncredir on http [puppet] - 10https://gerrit.wikimedia.org/r/762413 (https://phabricator.wikimedia.org/T291946) [10:36:12] (03CR) 10Filippo Giunchedi: [C: 03+2] hiera: expect 301 from text/upload/ncredir on http [puppet] - 10https://gerrit.wikimedia.org/r/762413 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:41:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298554)', diff saved to https://phabricator.wikimedia.org/P20656 and previous config saved to /var/cache/conftool/dbconfig/20220214-104136-ladsgroup.json [10:41:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:41:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:42] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [10:41:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298554)', diff saved to https://phabricator.wikimedia.org/P20657 and previous config saved to /var/cache/conftool/dbconfig/20220214-104143-ladsgroup.json [10:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:04] (03CR) 10JMeybohm: [C: 03+1] profile::kubernetes::node: add grub settings for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:46:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P20658 and previous config saved to /var/cache/conftool/dbconfig/20220214-104659-marostegui.json [10:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:13] (03PS1) 10Filippo Giunchedi: wmflib: add 'expect_redirect' option to probes [puppet] - 10https://gerrit.wikimedia.org/r/762415 (https://phabricator.wikimedia.org/T291946) [10:53:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298554)', diff saved to https://phabricator.wikimedia.org/P20659 and previous config saved to /var/cache/conftool/dbconfig/20220214-105328-ladsgroup.json [10:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:34] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [10:54:34] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33772/console" [puppet] - 10https://gerrit.wikimedia.org/r/762415 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:55:36] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] wmflib: add 'expect_redirect' option to probes [puppet] - 10https://gerrit.wikimedia.org/r/762415 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:56:36] !log restart apache/FPM on mediawiki canaries to pick up expat security updates [10:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [11:01:15] (03CR) 10Elukey: [C: 03+2] profile::kubernetes::node: add grub settings for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762410 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [11:02:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P20660 and previous config saved to /var/cache/conftool/dbconfig/20220214-110203-marostegui.json [11:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P20661 and previous config saved to /var/cache/conftool/dbconfig/20220214-110833-ladsgroup.json [11:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:33] (03PS1) 10Filippo Giunchedi: hiera: ncredir's /_status doesn't redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/762416 (https://phabricator.wikimedia.org/T291946) [11:14:13] (03CR) 10Filippo Giunchedi: [C: 03+2] hiera: ncredir's /_status doesn't redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/762416 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:15:00] 10Puppet, 10Infrastructure-Foundations: update hiera order in production environment - https://phabricator.wikimedia.org/T301349 (10jbond) [11:15:59] (03PS2) 10Vgutierrez: mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) [11:17:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300382)', diff saved to https://phabricator.wikimedia.org/P20662 and previous config saved to /var/cache/conftool/dbconfig/20220214-111708-marostegui.json [11:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:14] T300382: Make ipblocks_restrictions.ir_value unsigned on wmf wikis - https://phabricator.wikimedia.org/T300382 [11:17:56] (03CR) 10jerkins-bot: [V: 04-1] mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:19:57] (03PS3) 10Vgutierrez: mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) [11:21:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "All the task changes look correct to me, and diffConfig reports no change in the effective configuration. I just noticed one (old) typo th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [11:21:49] (03CR) 10jerkins-bot: [V: 04-1] mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:23:04] (03CR) 10Volans: "LGTM, one nit on a log message and one question inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [11:23:32] (03CR) 10Giuseppe Lavagetto: k8s: add module (038 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [11:23:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P20663 and previous config saved to /var/cache/conftool/dbconfig/20220214-112337-ladsgroup.json [11:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:31] (03PS1) 10Filippo Giunchedi: hiera: fix librenms probe [puppet] - 10https://gerrit.wikimedia.org/r/762417 (https://phabricator.wikimedia.org/T291946) [11:26:31] (03PS3) 10KartikMistry: Enable SectionTranslation in Occitan and Luganda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761626 (https://phabricator.wikimedia.org/T301443) [11:27:14] (03CR) 10Filippo Giunchedi: [C: 03+2] hiera: fix librenms probe [puppet] - 10https://gerrit.wikimedia.org/r/762417 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:27:23] (03PS2) 10Filippo Giunchedi: hiera: fix librenms probe [puppet] - 10https://gerrit.wikimedia.org/r/762417 (https://phabricator.wikimedia.org/T291946) [11:28:05] (03PS5) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) [11:32:41] (03PS7) 104nn1l2: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) [11:33:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/761698 (owner: 10Volans) [11:34:33] (03CR) 10jerkins-bot: [V: 04-1] k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [11:34:50] (03CR) 104nn1l2: InitialiseSettings: General cleanup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [11:35:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [11:35:49] (03CR) 10Jbond: [C: 03+2] hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [11:36:42] (03CR) 10Jbond: [C: 03+2] hiera: create script endpoint for exporting hiera data (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [11:36:44] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [11:38:16] (03PS8) 104nn1l2: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) [11:38:24] (03PS4) 10Vgutierrez: mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) [11:38:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298554)', diff saved to https://phabricator.wikimedia.org/P20664 and previous config saved to /var/cache/conftool/dbconfig/20220214-113842-ladsgroup.json [11:38:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:38:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:49] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [11:38:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298554)', diff saved to https://phabricator.wikimedia.org/P20665 and previous config saved to /var/cache/conftool/dbconfig/20220214-113850-ladsgroup.json [11:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:06] (03CR) 10Volans: [C: 03+2] requests: fix timeout [software/pywmflib] - 10https://gerrit.wikimedia.org/r/761698 (owner: 10Volans) [11:41:39] (03Merged) 10jenkins-bot: requests: fix timeout [software/pywmflib] - 10https://gerrit.wikimedia.org/r/761698 (owner: 10Volans) [11:42:08] (03PS1) 10Hnowlan: changeprop-jobqueue: more replicas, less CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/762418 (https://phabricator.wikimedia.org/T300914) [11:42:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1021.eqiad.wmnet [11:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:47] (03PS2) 10Vgutierrez: cache::haproxy: Log X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762411 (https://phabricator.wikimedia.org/T290005) [11:44:49] (03PS5) 10Vgutierrez: mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) [11:47:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1021.eqiad.wmnet [11:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:48:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:48:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [11:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:17] (03CR) 10Jbond: cookbook sre.puppet.netbox: Cookbook for syncing netbox puppet data (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [11:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T300662)', diff saved to https://phabricator.wikimedia.org/P20666 and previous config saved to /var/cache/conftool/dbconfig/20220214-114817-marostegui.json [11:48:19] (03PS8) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [11:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:22] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [11:49:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300662)', diff saved to https://phabricator.wikimedia.org/P20667 and previous config saved to /var/cache/conftool/dbconfig/20220214-114931-marostegui.json [11:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1021.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [11:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298554)', diff saved to https://phabricator.wikimedia.org/P20668 and previous config saved to /var/cache/conftool/dbconfig/20220214-115115-ladsgroup.json [11:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:20] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [11:51:36] (03PS1) 10Marostegui: change_pp_page_T300381.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762419 (https://phabricator.wikimedia.org/T300381) [11:51:47] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1009.eqiad.wmnet [11:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1021.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [11:51:54] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [11:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P20669 and previous config saved to /var/cache/conftool/dbconfig/20220214-115250-marostegui.json [11:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:26] (03CR) 10Ayounsi: [C: 03+1] "LGTM once addressed Riccardo's comments!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [11:58:40] PROBLEM - SSH on wtp1027.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:59:10] (03PS2) 10Matthias Mullie: [WikibaseMediaInfo] Make synonyms profile the default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761915 (https://phabricator.wikimedia.org/T301559) [12:00:45] jouncebot: now [12:00:46] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [12:02:22] you changed the B&C windows? [12:02:25] (03CR) 10Elukey: [C: 03+1] Sync cfssl-issuer app and chart versions to latest release [deployment-charts] - 10https://gerrit.wikimedia.org/r/762405 (owner: 10JMeybohm) [12:02:43] nn1l2: yes, it was announced on wikitech-l, https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/BJCVJOS6VIALLQ6RYACKNFNJSLAL2NUP/ [12:02:58] (03CR) 10Ayounsi: [C: 03+1] "No strong preference on the parameter vs. no parameter." [puppet] - 10https://gerrit.wikimedia.org/r/761372 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [12:03:04] wish you could communicate better :( [12:03:41] how? wikitech-l is considered _the_ communication method for tech things like that [12:04:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P20670 and previous config saved to /var/cache/conftool/dbconfig/20220214-120436-marostegui.json [12:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:22] you could have written a banner (big notice) at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220214T0800 [12:05:36] I mean https://wikitech.wikimedia.org/wiki/Deployments [12:06:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P20671 and previous config saved to /var/cache/conftool/dbconfig/20220214-120619-ladsgroup.json [12:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:11] (03PS6) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) [12:13:15] (03PS11) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [12:18:43] (03CR) 10Ayounsi: Add Wikidough's IPv6 anycast network in esams (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [12:19:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P20672 and previous config saved to /var/cache/conftool/dbconfig/20220214-121941-marostegui.json [12:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1016.eqiad.wmnet [12:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:55] (03CR) 10Majavah: Add Wikidough's IPv6 anycast network in esams (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [12:21:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P20673 and previous config saved to /var/cache/conftool/dbconfig/20220214-122124-ladsgroup.json [12:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1016.eqiad.wmnet [12:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:07] (03CR) 10Volans: "LGTM, one minor error inline. Then ofc this has to wait the module in spicerack and the completion of the inline TODOs." [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:28:52] 10SRE, 10Analytics, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10ayounsi) [12:34:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300662)', diff saved to https://phabricator.wikimedia.org/P20674 and previous config saved to /var/cache/conftool/dbconfig/20220214-123446-marostegui.json [12:34:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:34:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:52] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [12:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [12:35:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [12:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T300662)', diff saved to https://phabricator.wikimedia.org/P20675 and previous config saved to /var/cache/conftool/dbconfig/20220214-123506-marostegui.json [12:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300662)', diff saved to https://phabricator.wikimedia.org/P20676 and previous config saved to /var/cache/conftool/dbconfig/20220214-123620-marostegui.json [12:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298554)', diff saved to https://phabricator.wikimedia.org/P20677 and previous config saved to /var/cache/conftool/dbconfig/20220214-123629-ladsgroup.json [12:36:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:36:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:34] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [12:36:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298554)', diff saved to https://phabricator.wikimedia.org/P20678 and previous config saved to /var/cache/conftool/dbconfig/20220214-123636-ladsgroup.json [12:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:52] I'm looking for a DBA to deploy the four lines in https://gerrit.wikimedia.org/r/c/operations/puppet/+/759357 . Followup https://gerrit.wikimedia.org/r/c/operations/puppet/+/759359 is already deployed. [12:41:26] hey andre, I think a DBA is more likely to notice your message in -data-persistence. [12:41:39] heh, I was wondering whether that or this channel. Thanks! [12:42:21] (03PS1) 10Elukey: profile::docker::engine: skip profile::base::memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) [12:42:30] 10SRE, 10SRE-Access-Requests: saisuman ssh production public keys reused for WMCS - https://phabricator.wikimedia.org/T300708 (10SCherukuwada) My new public key is here: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIEQxRR4Wt7HJsEYRiWrHGwBPK4L8xbFC6IDy4dOPFahk I confirm that this key pair is not being used for anythi... [12:44:21] (03CR) 10Marostegui: "This change has been deployed on the live DBs" [puppet] - 10https://gerrit.wikimedia.org/r/759357 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [12:44:23] (03PS2) 10Elukey: profile::docker::engine: skip profile::base::memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) [12:44:26] (03CR) 10Marostegui: [C: 03+2] mariadb/phabricator: Add auth table to GRANTS for phstats [puppet] - 10https://gerrit.wikimedia.org/r/759357 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [12:45:01] (03CR) 10jerkins-bot: [V: 04-1] profile::docker::engine: skip profile::base::memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [12:45:38] (03CR) 10Elukey: "Running these new settings on ml-serve2006, everything looks good afaics." [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [12:46:16] (03PS3) 10Elukey: profile::docker::engine: skip profile::base::memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) [12:46:23] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:10] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33773/console" [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [12:47:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1016.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [12:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:31] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:34] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1016.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [12:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Add tls port for cloud vps rabbitmq [homer/public] - 10https://gerrit.wikimedia.org/r/755478 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [12:48:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [12:51:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P20679 and previous config saved to /var/cache/conftool/dbconfig/20220214-125125-marostegui.json [12:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:52] !log merging https://gerrit.wikimedia.org/r/c/operations/homer/public/+/755478 to core routers [12:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:00:24] (03PS2) 10Majavah: wmcs: stop accessing gu_hidden in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/760953 (https://phabricator.wikimedia.org/T289068) (owner: 10Zabe) [13:05:34] (03CR) 10Ladsgroup: [C: 03+1] change_pp_page_T300381.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762419 (https://phabricator.wikimedia.org/T300381) (owner: 10Marostegui) [13:06:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P20680 and previous config saved to /var/cache/conftool/dbconfig/20220214-130630-marostegui.json [13:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:45] (03CR) 10Marostegui: [V: 03+2 C: 03+2] change_pp_page_T300381.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762419 (https://phabricator.wikimedia.org/T300381) (owner: 10Marostegui) [13:16:09] RECOVERY - Ganeti memory on ganeti1022 is OK: OK Memory 89% used https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [13:19:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33775/console" [puppet] - 10https://gerrit.wikimedia.org/r/761637 (owner: 10Majavah) [13:20:42] (03CR) 10Ayounsi: "Looks good overall, a few comments." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:20:58] (03PS1) 10Filippo Giunchedi: hieradata: fix missing probes configuration [puppet] - 10https://gerrit.wikimedia.org/r/762437 (https://phabricator.wikimedia.org/T291946) [13:21:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300662)', diff saved to https://phabricator.wikimedia.org/P20681 and previous config saved to /var/cache/conftool/dbconfig/20220214-132135-marostegui.json [13:21:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [13:21:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [13:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:41] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [13:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:21:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T300662)', diff saved to https://phabricator.wikimedia.org/P20682 and previous config saved to /var/cache/conftool/dbconfig/20220214-132155-marostegui.json [13:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300662)', diff saved to https://phabricator.wikimedia.org/P20683 and previous config saved to /var/cache/conftool/dbconfig/20220214-132310-marostegui.json [13:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:08] (03PS9) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [13:24:12] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:25:06] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix missing probes configuration [puppet] - 10https://gerrit.wikimedia.org/r/762437 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:25:12] (03PS2) 10Filippo Giunchedi: hieradata: fix missing probes configuration [puppet] - 10https://gerrit.wikimedia.org/r/762437 (https://phabricator.wikimedia.org/T291946) [13:26:44] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10MMandere) a:05RobH→03MMandere [13:27:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298554)', diff saved to https://phabricator.wikimedia.org/P20684 and previous config saved to /var/cache/conftool/dbconfig/20220214-132736-ladsgroup.json [13:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:44] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [13:33:26] !log mmandere@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp4031.ulsfo.wmnet [13:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P20685 and previous config saved to /var/cache/conftool/dbconfig/20220214-133815-marostegui.json [13:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:42:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P20686 and previous config saved to /var/cache/conftool/dbconfig/20220214-134242-ladsgroup.json [13:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:24] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp4031.ulsfo.wmnet [13:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:33] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mmandere@cumin1001 for hosts: `cp4031.ulsfo.wmnet` - cp4031.ulsfo.wmnet (**PASS**) - Downtimed... [13:45:30] (JobUnavailable) firing: (4) Reduced availability for job cache_envoy in ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:45:46] (03CR) 10MMandere: [C: 03+2] Remove cp4031 from cluster data [puppet] - 10https://gerrit.wikimedia.org/r/761012 (https://phabricator.wikimedia.org/T301269) (owner: 10BBlack) [13:47:56] !log rolling restart of apache on logstash* to pick up expat security updates [13:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:30] (JobUnavailable) firing: (4) Reduced availability for job cache_envoy in ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:53:05] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10MMandere) a:05MMandere→03RobH [13:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P20687 and previous config saved to /var/cache/conftool/dbconfig/20220214-135320-marostegui.json [13:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:42] !log Jenkins contint instances are going to be restarted soon [13:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P20688 and previous config saved to /var/cache/conftool/dbconfig/20220214-135746-ladsgroup.json [13:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220214T1400). [14:00:05] cirno, matthiasmullie, and nn1l2: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] hi [14:00:11] o/ [14:00:15] hi [14:01:07] o/ [14:01:30] I can deploy :) [14:01:46] first week of the new deployment schedule, exciting stuff [14:02:02] Hello, I am on clinic duty this week. Requesting topic change to reflect the same :) [14:02:11] mmandere: I will do that [14:02:31] (03PS2) 10Lucas Werkmeister (WMDE): Upload logo for apiportalwiki in wmgCentralAuthLoginIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762119 (https://phabricator.wikimedia.org/T301636) (owner: 10Stang) [14:02:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Upload logo for apiportalwiki in wmgCentralAuthLoginIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762119 (https://phabricator.wikimedia.org/T301636) (owner: 10Stang) [14:02:43] let’s start with the logos, logos are nice [14:02:52] Thank you marostegui! [14:04:56] (03PS1) 10Jbond: P:mail::mx: Disable ldap authentication on mx1001 [puppet] - 10https://gerrit.wikimedia.org/r/762442 (https://phabricator.wikimedia.org/T244792) [14:05:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33776/console" [puppet] - 10https://gerrit.wikimedia.org/r/762442 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [14:05:44] Lucas_WMDE, this does not work, how can I collect log from the plugin [14:05:59] what do you mean? [14:07:30] (03Merged) 10jenkins-bot: Upload logo for apiportalwiki in wmgCentralAuthLoginIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762119 (https://phabricator.wikimedia.org/T301636) (owner: 10Stang) [14:08:03] oh, sorry at that time it is not merged yet [14:08:07] nvm :) [14:08:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300662)', diff saved to https://phabricator.wikimedia.org/P20689 and previous config saved to /var/cache/conftool/dbconfig/20220214-140824-marostegui.json [14:08:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:08:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:31] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [14:08:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T300662)', diff saved to https://phabricator.wikimedia.org/P20690 and previous config saved to /var/cache/conftool/dbconfig/20220214-140832-marostegui.json [14:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:36] syncing the file now… [14:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:49] and then it’ll probably need a cache purge [14:09:19] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/sul/foundation-black.png: Config: [[gerrit:762119|Upload logo for apiportalwiki in wmgCentralAuthLoginIcon (T301636)]] (duration: 00m 49s) [14:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:23] T301636: Missing icons for apiportalwiki and wikimaniawiki - https://phabricator.wikimedia.org/T301636 [14:09:39] https://en.wikipedia.org/static/images/sul/foundation-black.png works on my end but let’s purge it anyways to be sure [14:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300662)', diff saved to https://phabricator.wikimedia.org/P20691 and previous config saved to /var/cache/conftool/dbconfig/20220214-140947-marostegui.json [14:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:11] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf '%s\n' 'https://en.wikipedia.org/static/images/sul/foundation-black.png' | mwscript purgeList.php # T301636 [14:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:32] (03PS3) 10Lucas Werkmeister (WMDE): Fix missing icons for apiportalwiki and wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762111 (https://phabricator.wikimedia.org/T301636) (owner: 10Stang) [14:10:53] (03PS1) 10Jbond: nodegen: Fix typo and change log level [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762444 [14:11:07] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix missing icons for apiportalwiki and wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762111 (https://phabricator.wikimedia.org/T301636) (owner: 10Stang) [14:11:12] (03CR) 10Jbond: [C: 03+2] nodegen: Fix typo and change log level [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762444 (owner: 10Jbond) [14:11:33] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1183 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/762146 (https://phabricator.wikimedia.org/T301219) (owner: 10Marostegui) [14:12:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298554)', diff saved to https://phabricator.wikimedia.org/P20692 and previous config saved to /var/cache/conftool/dbconfig/20220214-141251-ladsgroup.json [14:12:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [14:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [14:12:56] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [14:12:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:13:01] (03Merged) 10jenkins-bot: nodegen: Fix typo and change log level [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762444 (owner: 10Jbond) [14:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298554)', diff saved to https://phabricator.wikimedia.org/P20693 and previous config saved to /var/cache/conftool/dbconfig/20220214-141304-ladsgroup.json [14:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:44] (03Merged) 10jenkins-bot: Fix missing icons for apiportalwiki and wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762111 (https://phabricator.wikimedia.org/T301636) (owner: 10Stang) [14:13:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:13:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:24] cirno: the second logos change is on mwdebug1001, can you test it there? [14:15:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:15:03] works pretty fine for me [14:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:11] ok thanks [14:16:24] (03PS12) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [14:16:53] syncing [14:17:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/762442 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [14:17:12] (03PS3) 10Lucas Werkmeister (WMDE): [WikibaseMediaInfo] Make synonyms profile the default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761915 (https://phabricator.wikimedia.org/T301559) (owner: 10Matthias Mullie) [14:17:39] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:762111|Fix missing icons for apiportalwiki and wikimaniawiki (T301636)]] (duration: 00m 49s) [14:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:44] T301636: Missing icons for apiportalwiki and wikimaniawiki - https://phabricator.wikimedia.org/T301636 [14:19:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] [WikibaseMediaInfo] Make synonyms profile the default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761915 (https://phabricator.wikimedia.org/T301559) (owner: 10Matthias Mullie) [14:20:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:29] (03Merged) 10jenkins-bot: [WikibaseMediaInfo] Make synonyms profile the default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761915 (https://phabricator.wikimedia.org/T301559) (owner: 10Matthias Mullie) [14:21:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:21:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:10] (03PS1) 10Filippo Giunchedi: prometheus: per-service icmp prober module [puppet] - 10https://gerrit.wikimedia.org/r/762446 (https://phabricator.wikimedia.org/T291946) [14:22:12] (03PS1) 10Filippo Giunchedi: hieradata: fix labweb probe [puppet] - 10https://gerrit.wikimedia.org/r/762447 (https://phabricator.wikimedia.org/T291946) [14:22:31] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:22:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:38] matthiasmullie: the MediaInfo change is on mwdebug1001, can you test it? [14:23:44] checking [14:24:09] congratulations jnuche on upgrading the fleet of Jenkins :] [14:24:13] Lucas_WMDE: LGTM! [14:24:31] ok! [14:24:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P20694 and previous config saved to /var/cache/conftool/dbconfig/20220214-142452-marostegui.json [14:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:03] (03PS14) 10Lucas Werkmeister (WMDE): trwikisource: Enable ULS webfonts by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [14:25:23] 10SRE, 10SRE-Access-Requests: Requesting access to Superset/Turnilo for Kinneretgordon - https://phabricator.wikimedia.org/T301098 (10KinneretG) Hi, I just realized that was the issue. I used my contractor username and full-time employee password. So all is good now :) I moved from contractor to full-time so m... [14:25:38] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:761915|[WikibaseMediaInfo] Make synonyms profile the default (T301559)]] (duration: 00m 48s) [14:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:43] T301559: Enable mediawiki_synonyms profile by default - https://phabricator.wikimedia.org/T301559 [14:26:03] !log Jenkins upgrade complete [14:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:18] (03CR) 10JMeybohm: [C: 03+1] Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto) [14:26:19] \o/ [14:26:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] trwikisource: Enable ULS webfonts by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [14:26:41] Lucas_WMDE: thanks! [14:27:05] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:27:25] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33778/console" [puppet] - 10https://gerrit.wikimedia.org/r/762446 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:27:30] (03Merged) 10jenkins-bot: trwikisource: Enable ULS webfonts by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [14:27:35] !log installing Java 8/stretch security updates [14:27:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] nn1l2: the trwikisource change is on mwdebug1001, can you test it? [14:28:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:28:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:49] ok [14:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:22] LGTM [14:29:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto) [14:29:37] ok [14:30:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:16] (03PS9) 10Lucas Werkmeister (WMDE): InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:30:50] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:694315|trwikisource: Enable ULS webfonts by default (T283626)]] (duration: 00m 48s) [14:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:55] T283626: Enable ULS webfonts by default on trwikisource - https://phabricator.wikimedia.org/T283626 [14:31:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rzl - https://phabricator.wikimedia.org/T301606 (10Ottomata) Approved [14:31:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:32:33] (03Merged) 10jenkins-bot: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762140 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:32:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Ottomata) Approved. I believe since this is not a WMF employee, we'll need an expiry date to put on the account. It can always be extended later if needed. [14:32:57] nn1l2: cleanup is on mwdebug1001 [14:33:05] not much to test I guess, since it shouldn’t change anything ^^ [14:33:11] I’ll see that nothing obvious breaks at least [14:33:44] I think it's not testable,and you can go ahead sith syncing [14:33:53] (03Merged) 10jenkins-bot: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto) [14:33:58] *with [14:34:12] enwikisource editing still works at least [14:34:14] good enough for me [14:35:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:30] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:762140|InitialiseSettings: General cleanup (T301647)]] (should be a no-op) (duration: 00m 48s) [14:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:35] T301647: Clean up InitialiseSettings - https://phabricator.wikimedia.org/T301647 [14:36:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:36:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:57] !log UTC afternoon backport window done [14:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:30] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: per-service icmp prober module [puppet] - 10https://gerrit.wikimedia.org/r/762446 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:37:35] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix labweb probe [puppet] - 10https://gerrit.wikimedia.org/r/762447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:37:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:49] Lucas_WMDE: Was the size of my InitialiseSettings clean up patch appropriate? I mean not too short, not too long? If not, I can make future patches shorter on longer. [14:39:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P20695 and previous config saved to /var/cache/conftool/dbconfig/20220214-143956-marostegui.json [14:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:11] I think I would prefer them a bit shorter [14:40:16] but it was still okay [14:40:30] (03CR) 10Cathal Mooney: "Thanks for the feedback, some replies in-line." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:40:50] okay [14:41:20] 10SRE, 10Observability-Logging, 10User-fgiunchedi: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10faidon) My (perhaps dated or incorrect) understanding is that: 1. We currently have no RBAC in Logstash; 1. Everyone in the "NDA" group have access to all data stored... [14:41:46] (03CR) 10Cathal Mooney: "Some replies inline. Will upload new patchset for the log message, Was gonna leave the other one as is for now but if you feel its import" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [14:42:30] See you tomorrow :) Should I add you to the reviewers before the B&C window? Or you can do the reviewing task in the B&C window itself? [14:42:57] it’s better to do it beforehand [14:43:00] since it takes some time [14:43:05] so feel free to keep adding me [14:43:14] Thanks! :) [14:46:30] (03PS1) 10Vgutierrez: cache::haproxy: Limit HAProxy to the same NUMA node as the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/762449 (https://phabricator.wikimedia.org/T290005) [14:47:40] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33779/console" [puppet] - 10https://gerrit.wikimedia.org/r/762449 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:51:33] !log disable puppet on cp nodes running HAProxy - T290005 [14:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:39] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:52:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298554)', diff saved to https://phabricator.wikimedia.org/P20696 and previous config saved to /var/cache/conftool/dbconfig/20220214-145257-ladsgroup.json [14:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:09] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [14:54:01] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Limit HAProxy to the same NUMA node as the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/762449 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:54:05] (03PS3) 10JMeybohm: Sync cfssl-issuer app and chart versions to latest release [deployment-charts] - 10https://gerrit.wikimedia.org/r/762405 [14:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300662)', diff saved to https://phabricator.wikimedia.org/P20697 and previous config saved to /var/cache/conftool/dbconfig/20220214-145501-marostegui.json [14:55:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [14:55:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [14:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:07] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [14:55:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T300662)', diff saved to https://phabricator.wikimedia.org/P20698 and previous config saved to /var/cache/conftool/dbconfig/20220214-145508-marostegui.json [14:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300662)', diff saved to https://phabricator.wikimedia.org/P20699 and previous config saved to /var/cache/conftool/dbconfig/20220214-145625-marostegui.json [14:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:29] RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:04:33] (03PS1) 10Filippo Giunchedi: hieradata: swap prometheus200[46] [puppet] - 10https://gerrit.wikimedia.org/r/762453 (https://phabricator.wikimedia.org/T296199) [15:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P20700 and previous config saved to /var/cache/conftool/dbconfig/20220214-150801-ladsgroup.json [15:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P20701 and previous config saved to /var/cache/conftool/dbconfig/20220214-151130-marostegui.json [15:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:07] 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10Tom_Magerlein) [15:13:15] 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10Tom_Magerlein) [15:17:36] 10SRE-Access-Requests: Requesting access to analytics-privatedata for Kiron Lebeck (klebeck-tmlt) - https://phabricator.wikimedia.org/T301680 (10Klebeck-tmlt) [15:19:16] 10SRE-Access-Requests: Requesting access to analytics-privatedata for Kiron Lebeck (klebeck-tmlt) - https://phabricator.wikimedia.org/T301680 (10Klebeck-tmlt) [15:19:29] (03CR) 10Giuseppe Lavagetto: prometheus-exporters/statsd: Run with --log.level=warn (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758868 (https://phabricator.wikimedia.org/T300629) (owner: 10JMeybohm) [15:23:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P20703 and previous config saved to /var/cache/conftool/dbconfig/20220214-152306-ladsgroup.json [15:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:22] (03CR) 10JMeybohm: prometheus-exporters/statsd: Run with --log.level=warn (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758868 (https://phabricator.wikimedia.org/T300629) (owner: 10JMeybohm) [15:26:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P20704 and previous config saved to /var/cache/conftool/dbconfig/20220214-152635-marostegui.json [15:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] prometheus-exporters/statsd: Run with --log.level=warn [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758868 (https://phabricator.wikimedia.org/T300629) (owner: 10JMeybohm) [15:32:03] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:35:38] (03PS3) 10Ssingh: wikimedia-dns.org: add AAAA records for Wikidough [dns] - 10https://gerrit.wikimedia.org/r/761363 (https://phabricator.wikimedia.org/T301165) [15:35:53] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] prometheus-exporters/statsd: Run with --log.level=warn [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758868 (https://phabricator.wikimedia.org/T300629) (owner: 10JMeybohm) [15:36:08] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus-exporters/statsd: Run with --log.level=warn [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758868 (https://phabricator.wikimedia.org/T300629) (owner: 10JMeybohm) [15:37:32] (03PS1) 10JMeybohm: Fix distribution in prometheus-statsd-exporter changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/762460 (https://phabricator.wikimedia.org/T300629) [15:37:42] !log published image docker-registry.discovery.wmnet/prometheus-statsd-exporter:0.0.10 [15:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::docker::engine: skip profile::base::memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:38:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298554)', diff saved to https://phabricator.wikimedia.org/P20705 and previous config saved to /var/cache/conftool/dbconfig/20220214-153811-ladsgroup.json [15:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:16] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [15:38:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [15:38:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [15:38:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [15:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [15:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:41] (03PS1) 10Vgutierrez: haproxy::tls_terminator: Fix cpu-map [puppet] - 10https://gerrit.wikimedia.org/r/762462 (https://phabricator.wikimedia.org/T290005) [15:39:26] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix distribution in prometheus-statsd-exporter changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/762460 (https://phabricator.wikimedia.org/T300629) (owner: 10JMeybohm) [15:39:57] 10SRE-Access-Requests: Change Ai-Jou Chou's account to WMF employee - https://phabricator.wikimedia.org/T301681 (10elukey) [15:39:59] (03PS1) 10JMeybohm: Update default prometheus-statsd-exporter version to 0.0.10 [puppet] - 10https://gerrit.wikimedia.org/r/762463 (https://phabricator.wikimedia.org/T300629) [15:40:45] (03CR) 10Ssingh: "(rebased; no code change)" [dns] - 10https://gerrit.wikimedia.org/r/761363 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:41:01] (03CR) 10Vgutierrez: [C: 03+2] haproxy::tls_terminator: Fix cpu-map [puppet] - 10https://gerrit.wikimedia.org/r/762462 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:41:39] (03CR) 10Ssingh: [C: 03+2] wikimedia-dns.org: add AAAA records for Wikidough [dns] - 10https://gerrit.wikimedia.org/r/761363 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:41:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300662)', diff saved to https://phabricator.wikimedia.org/P20706 and previous config saved to /var/cache/conftool/dbconfig/20220214-154139-marostegui.json [15:41:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:41:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:45] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [15:41:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T300662)', diff saved to https://phabricator.wikimedia.org/P20707 and previous config saved to /var/cache/conftool/dbconfig/20220214-154147-marostegui.json [15:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:18] !log running authdns-update for T301165 [15:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:23] T301165: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 [15:44:16] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::docker::engine: skip profile::base::memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/762432 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:45:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300662)', diff saved to https://phabricator.wikimedia.org/P20708 and previous config saved to /var/cache/conftool/dbconfig/20220214-154502-marostegui.json [15:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:07] PROBLEM - puppet last run on logstash1025 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:45:47] PROBLEM - puppet last run on logstash1024 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:45:58] !log re-enable puppet on cp nodes running HAProxy - T290005 [15:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:02] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:48:04] (03PS3) 10Ssingh: bird: update vips_filter for Wikidough's IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/761372 (https://phabricator.wikimedia.org/T301165) [15:48:51] PROBLEM - Host check.wikimedia-dns.org is DOWN: /bin/ping -6 -n -U -w 10 -c 2 check.wikimedia-dns.org [15:49:38] hmm interesting [15:50:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:50:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:31] RECOVERY - puppet last run on logstash1024 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:50:31] RECOVERY - puppet last run on logstash1025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:53:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rzl - https://phabricator.wikimedia.org/T301606 (10akosiaris) Approved [15:55:17] RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms [15:55:25] (03PS1) 10Elukey: admin: move user aikochou to full employee [puppet] - 10https://gerrit.wikimedia.org/r/762468 (https://phabricator.wikimedia.org/T301681) [15:56:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Change Ai-Jou Chou's account to WMF employee - https://phabricator.wikimedia.org/T301681 (10elukey) [15:56:37] (03CR) 10Ssingh: [C: 03+2] bird: update vips_filter for Wikidough's IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/761372 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [16:00:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P20709 and previous config saved to /var/cache/conftool/dbconfig/20220214-160007-marostegui.json [16:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Change Ai-Jou Chou's account to WMF employee - https://phabricator.wikimedia.org/T301681 (10calbon) This is for a new hire on my team. [16:01:53] (03CR) 10Herron: prometheus: split probes by network sphere and address family (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:01:57] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2021/2022-Q3): Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10lmata) [16:02:40] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/762442 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [16:04:03] PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100% [16:04:46] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10razzi) Yep @BTullis I'll get right to it. [16:07:06] !log update mx1001 to disable ldap validation of gmail emails gerrit:762442 (allready on mx2001) [16:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:12] (03PS1) 10Ladsgroup: Add tox.ini [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762469 [16:07:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:mail::mx: Disable ldap authentication on mx1001 [puppet] - 10https://gerrit.wikimedia.org/r/762442 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [16:07:38] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add tox.ini [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762469 (owner: 10Ladsgroup) [16:08:10] sukhe: FYI ^^^ (check.wikimedia-dns.org) [16:08:18] volans: yep [16:08:20] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host datahubsearch1002.eqiad.wmnet [16:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:26] (03PS1) 10Majavah: toolforge cron runner: better support for secondary nodes [puppet] - 10https://gerrit.wikimedia.org/r/762470 [16:10:13] ping -6 is failing, which is OK since we don't actually have v6 on the anycast yet but I still need to silence it [16:10:50] why are we serving AAAA records if the service does not support IPv6 yet? [16:11:35] we can stop serving AAAA records but I wanted to publish them to test other stuff, like https://gerrit.wikimedia.org/r/c/operations/puppet/+/761362 [16:14:04] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) They closed the old ticket due to delays in the router's shipment arrival, so opened a new one today as the routers are now at drmr... [16:14:26] 10SRE, 10Observability-Logging, 10User-fgiunchedi: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10fgiunchedi) >>! In T301110#7707705, @faidon wrote: > My (perhaps dated or incorrect) understanding is that: > 1. We currently have no RBAC in Logstash; > 1. Everyone... [16:15:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P20710 and previous config saved to /var/cache/conftool/dbconfig/20220214-161511-marostegui.json [16:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:44] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10cmooney) @MatthewVernon thanks for the feedback. I'll know by mid-week if we are on target. Should be fine to have everything ready by next week for you.... [16:17:09] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:13] 10SRE, 10Discovery: Test network optimizations in RELForge - https://phabricator.wikimedia.org/T301683 (10bking) [16:20:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/762468 (https://phabricator.wikimedia.org/T301681) (owner: 10Elukey) [16:21:24] (03PS1) 10Ladsgroup: [DNM] Test CI [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762474 [16:21:49] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: swap prometheus200[46] [puppet] - 10https://gerrit.wikimedia.org/r/762453 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [16:24:00] (03CR) 10Klausman: [C: 03+1] admin: move user aikochou to full employee [puppet] - 10https://gerrit.wikimedia.org/r/762468 (https://phabricator.wikimedia.org/T301681) (owner: 10Elukey) [16:26:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:26:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:30] (03Abandoned) 10Ladsgroup: [DNM] Test CI [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762474 (owner: 10Ladsgroup) [16:30:05] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220214T1630). [16:30:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300662)', diff saved to https://phabricator.wikimedia.org/P20711 and previous config saved to /var/cache/conftool/dbconfig/20220214-163016-marostegui.json [16:30:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:30:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:22] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [16:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:30:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:30:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [16:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [16:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:31:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:31:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T300662)', diff saved to https://phabricator.wikimedia.org/P20712 and previous config saved to /var/cache/conftool/dbconfig/20220214-163113-marostegui.json [16:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300662)', diff saved to https://phabricator.wikimedia.org/P20713 and previous config saved to /var/cache/conftool/dbconfig/20220214-163228-marostegui.json [16:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:08] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10MoritzMuehlenhoff) Please use row A, B and C, row D is currently disproportionately full compared to the others. [16:33:23] (03PS1) 10MSantos: maps: don't load maps services in maps master [puppet] - 10https://gerrit.wikimedia.org/r/762477 [16:35:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:40:15] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33781/console" [puppet] - 10https://gerrit.wikimedia.org/r/762470 (owner: 10Majavah) [16:40:53] !log razzi@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host datahubsearch1002.eqiad.wmnet [16:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:46:40] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762480 (https://phabricator.wikimedia.org/T128546) [16:47:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P20714 and previous config saved to /var/cache/conftool/dbconfig/20220214-164733-marostegui.json [16:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:51:13] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762480 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:51:57] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762480 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:54:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Htriedman) I believe the expiry should be roughly 6 months from now — let's say (for the moment, at least) 31 August 2022. [16:54:08] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:762480| Bumping portals to master (T128546)]] (duration: 00m 50s) [16:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:12] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:54:25] (03PS1) 10RLazarus: admin: Add rzl to analytics-privatedata-users, add kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/762481 (https://phabricator.wikimedia.org/T301606) [16:54:57] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:762480| Bumping portals to master (T128546)]] (duration: 00m 49s) [16:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:15] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10RhinosF1) The expiry date will be whenever the contract ends. You should check this with whoever signed it. [16:55:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:55:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:56:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:01:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P20715 and previous config saved to /var/cache/conftool/dbconfig/20220214-170238-marostegui.json [17:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:01] (03CR) 10Elukey: [C: 03+2] admin: move user aikochou to full employee [puppet] - 10https://gerrit.wikimedia.org/r/762468 (https://phabricator.wikimedia.org/T301681) (owner: 10Elukey) [17:09:37] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) Arzhel reminded me in our sync up meeting about the too short cable for mr1, so we've appended this to the ticket: Support, > I... [17:13:18] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-node-exporter.service,wmf_auto_restart_prometheus-swagger-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:30] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) [17:17:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300662)', diff saved to https://phabricator.wikimedia.org/P20717 and previous config saved to /var/cache/conftool/dbconfig/20220214-171743-marostegui.json [17:17:44] ACKNOWLEDGEMENT - ElasticSearch setting check - 9400 on elastic1057 is CRITICAL: CRITICAL - .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511#7708316 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:17:44] ACKNOWLEDGEMENT - ElasticSearch setting check - 9400 on elastic1068 is CRITICAL: CRITICAL - .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511#7708316 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:17:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [17:17:44] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic1073 is CRITICAL: CRITICAL - .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511#7708316 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:17:44] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic1075 is CRITICAL: CRITICAL - .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511#7708316 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:17:44] ACKNOWLEDGEMENT - ElasticSearch setting check - 9400 on elastic1076 is CRITICAL: CRITICAL - .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511#7708316 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:17:45] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic1083 is CRITICAL: CRITICAL - .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511#7708316 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:17:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [17:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:49] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [17:17:50] (03CR) 10JMeybohm: [C: 03+1] k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [17:17:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T300662)', diff saved to https://phabricator.wikimedia.org/P20718 and previous config saved to /var/cache/conftool/dbconfig/20220214-171750-marostegui.json [17:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:21] (03CR) 10Hashar: "Thx Ahmon, I will follow up tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [17:19:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300662)', diff saved to https://phabricator.wikimedia.org/P20719 and previous config saved to /var/cache/conftool/dbconfig/20220214-171905-marostegui.json [17:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:13] (03PS1) 10Jbond: prepare: fix subprocess calls [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762487 [17:25:59] (03PS7) 10Cathal Mooney: Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [17:27:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Htriedman) @RhinosF1 just checked, the expiry date is 13 September 2022 [17:27:41] (03PS2) 10Jbond: prepare: fix subprocess calls [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762487 [17:27:43] (03PS1) 10Jbond: 2.1.1: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762488 [17:28:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10Htriedman) Hi SRE, just want to drop in here that this task should have the same comments as [[ https://phabricator.wikimedia.org/T301581 | T301581 ]]: - @JBennett is appro... [17:28:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Kiron Lebeck (klebeck-tmlt) - https://phabricator.wikimedia.org/T301680 (10Htriedman) Hi SRE, just want to drop in here that this task should have the same comments as [[ https://phabricator.wikimedia.org/T301581 | T301581 ]]: - @JBen... [17:28:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10Htriedman) Hi SRE, just want to drop in here that this task should have the same comments as [[ https://phabricator.wikimedia.org/T301581 | T301581 ]]: - @JBennett is approv... [17:29:18] (03CR) 10Jbond: [C: 03+2] prepare: fix subprocess calls [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762487 (owner: 10Jbond) [17:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (hw issue)', diff saved to https://phabricator.wikimedia.org/P20720 and previous config saved to /var/cache/conftool/dbconfig/20220214-172924-ladsgroup.json [17:29:24] (03CR) 10Jbond: [C: 03+2] 2.1.1: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762488 (owner: 10Jbond) [17:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:44] (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/762489 [17:30:44] (03Merged) 10jenkins-bot: prepare: fix subprocess calls [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762487 (owner: 10Jbond) [17:31:08] (03Merged) 10jenkins-bot: 2.1.1: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/762488 (owner: 10Jbond) [17:31:30] 10SRE, 10SRE-Access-Requests: Change Ai-Jou Chou's account to WMF employee - https://phabricator.wikimedia.org/T301681 (10elukey) [17:31:44] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/762489 (owner: 10Jbond) [17:32:04] 10SRE, 10SRE-Access-Requests: Change Ai-Jou Chou's account to WMF employee - https://phabricator.wikimedia.org/T301681 (10elukey) 05Open→03Resolved a:03elukey As far as I can see there is not much more metadata to change, I'll report in this task if I find more. [17:32:58] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10Dzahn) @Volans Yes, it has been fixed by making etherpad listen on "::" with https://gerrit.wikimedia.org/r/c/o... [17:34:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P20721 and previous config saved to /var/cache/conftool/dbconfig/20220214-173410-marostegui.json [17:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:18] PROBLEM - puppet last run on wcqs2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604854 seconds, message: test - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:34:58] (03PS1) 10Eigyan: wmf-config: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762490 (https://phabricator.wikimedia.org/T297629) [17:35:02] PROBLEM - puppet last run on wcqs2001 is CRITICAL: CRITICAL: Puppet has been disabled for 604898 seconds, message: test - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:35:02] PROBLEM - puppet last run on wcqs2003 is CRITICAL: CRITICAL: Puppet has been disabled for 604898 seconds, message: test - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:35:08] (03CR) 10Dzahn: [C: 03+1] admin: Add rzl to analytics-privatedata-users, add kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/762481 (https://phabricator.wikimedia.org/T301606) (owner: 10RLazarus) [17:35:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:35:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298554)', diff saved to https://phabricator.wikimedia.org/P20722 and previous config saved to /var/cache/conftool/dbconfig/20220214-173526-ladsgroup.json [17:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:31] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [17:35:34] (03PS2) 10Eigyan: wmf-config: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762490 (https://phabricator.wikimedia.org/T297629) [17:35:45] (03PS2) 10RLazarus: admin: Add rzl to analytics-privatedata-users, add kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/762481 (https://phabricator.wikimedia.org/T301606) [17:36:38] (03PS1) 10Elukey: Add ml-serve200[7,8] to the k8s ml-serve-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/762491 (https://phabricator.wikimedia.org/T300744) [17:37:45] (03CR) 10Volans: [C: 04-1] "LGTM, voting -1 just because this has to wait for https://phabricator.wikimedia.org/T276589#7420124 to be solved before it can be merged." [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [17:39:01] (03PS4) 10Dzahn: DHCP: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761661 [17:39:52] (03CR) 10Dzahn: [C: 03+2] DHCP: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761661 (owner: 10Dzahn) [17:40:13] (03PS2) 10Elukey: Add ml-serve200[7,8] to the k8s ml-serve-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/762491 (https://phabricator.wikimedia.org/T300744) [17:40:19] (03PS3) 10Eigyan: wmf-config: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762490 (https://phabricator.wikimedia.org/T297629) [17:40:40] (03PS4) 10Eigyan: wmf-config: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762490 (https://phabricator.wikimedia.org/T297629) [17:41:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance - hw issues [17:41:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance - hw issues [17:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:40] (03PS1) 10Jelto: gitlab: rename test instance, use letsencrypt certs [puppet] - 10https://gerrit.wikimedia.org/r/762495 (https://phabricator.wikimedia.org/T297411) [17:45:50] (03CR) 10Dzahn: [C: 03+1] "Yes, we talked about this in our meeting today. Also it makes sense that certs are in the new place with certbot, ACK" [puppet] - 10https://gerrit.wikimedia.org/r/762495 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [17:47:11] (03PS1) 10Elukey: Add ml-serve200[7,8] to the k8s ml-serve-codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/762496 (https://phabricator.wikimedia.org/T300744) [17:48:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts etherpad1002.eqiad.wmnet [17:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P20724 and previous config saved to /var/cache/conftool/dbconfig/20220214-174915-marostegui.json [17:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:44] (03CR) 10RLazarus: [C: 03+2] admin: Add rzl to analytics-privatedata-users, add kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/762481 (https://phabricator.wikimedia.org/T301606) (owner: 10RLazarus) [17:55:50] (03CR) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [17:56:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for rzl - https://phabricator.wikimedia.org/T301606 (10RLazarus) [17:56:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for rzl - https://phabricator.wikimedia.org/T301606 (10RLazarus) 05Open→03Resolved a:03RLazarus Done, thanks! [17:58:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts etherpad1002.eqiad.wmnet [17:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:toolforge: remove clush access [puppet] - 10https://gerrit.wikimedia.org/r/761337 (https://phabricator.wikimedia.org/T298191) (owner: 10Majavah) [17:58:17] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `etherpad1002.eqiad.w... [17:59:10] (03PS3) 10Majavah: P:openstack::galera: drop puppetmaster firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/760643 [17:59:57] 10SRE, 10Wikimedia-Etherpad, 10serviceops: vm request for etherpad1002 - https://phabricator.wikimedia.org/T243475 (10Dzahn) decom'ed today as part of T300568 [18:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220214T1800). [18:03:13] (03CR) 10Dzahn: [C: 03+1] "Looking at this again, do we have to still add new rules to allow monitoring and localhost?" [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [18:04:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300662)', diff saved to https://phabricator.wikimedia.org/P20725 and previous config saved to /var/cache/conftool/dbconfig/20220214-180419-marostegui.json [18:04:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:04:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:25] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [18:04:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T300662)', diff saved to https://phabricator.wikimedia.org/P20726 and previous config saved to /var/cache/conftool/dbconfig/20220214-180427-marostegui.json [18:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300662)', diff saved to https://phabricator.wikimedia.org/P20727 and previous config saved to /var/cache/conftool/dbconfig/20220214-180541-marostegui.json [18:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:12] (03CR) 10Dzahn: "Hmm.. so the docker class does not really do anything besides requiring docker::configuration and the package install. And you already hav" [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:15:02] 10ops-codfw, 10DBA: codfw: db2136: Correctable memory error rate exceeded for DIMM_B5 - https://phabricator.wikimedia.org/T301713 (10Papaul) [18:17:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298554)', diff saved to https://phabricator.wikimedia.org/P20728 and previous config saved to /var/cache/conftool/dbconfig/20220214-181714-ladsgroup.json [18:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:20] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [18:19:58] PROBLEM - Host db2136.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:20:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P20729 and previous config saved to /var/cache/conftool/dbconfig/20220214-182046-marostegui.json [18:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:55] RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:35] (03CR) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:25:01] RECOVERY - Host db2136.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.87 ms [18:30:23] (03PS2) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for buster [puppet] - 10https://gerrit.wikimedia.org/r/758986 (https://phabricator.wikimedia.org/T300682) [18:30:25] (03PS3) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) [18:31:37] (03CR) 10jerkins-bot: [V: 04-1] contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:32:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P20730 and previous config saved to /var/cache/conftool/dbconfig/20220214-183218-ladsgroup.json [18:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:24] (03PS4) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) [18:35:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P20731 and previous config saved to /var/cache/conftool/dbconfig/20220214-183551-marostegui.json [18:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:39] (03CR) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:40:17] (03PS5) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) [18:42:00] (03CR) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:42:31] (03PS19) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [18:44:43] !log contint2001 - disabling puppet, try replacing docker version (docker-io -> docker-ce), contint1001 first which is currently NOT the active server - gerrit:758987 T300682 [18:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:49] T300682: contint1001 and contint2001 need a newer version of Docker installed - https://phabricator.wikimedia.org/T300682 [18:45:00] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.0.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/762508 [18:45:16] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.0.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/762508 (owner: 10Volans) [18:45:20] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33785/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:46:56] (03PS1) 10Andrew Bogott: dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) [18:47:02] (03CR) 10Dzahn: [C: 03+2] contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:47:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P20732 and previous config saved to /var/cache/conftool/dbconfig/20220214-184723-ladsgroup.json [18:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:34] (03CR) 10Dzahn: [C: 03+1] "ah, wait, I see the dependency is not merged yet. ACK" [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [18:47:36] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.0.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/762508 (owner: 10Volans) [18:47:38] (03CR) 10jerkins-bot: [V: 04-1] dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) (owner: 10Andrew Bogott) [18:48:35] PROBLEM - Host db2136.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:49:00] (03PS20) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [18:49:10] (03CR) 10Jbond: "Ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [18:50:11] RECOVERY - Host db2136.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [18:50:36] Is that supposed to be flapping [18:50:46] Amir1: ^ [18:50:51] (03PS1) 10Volans: Upstream release v1.0.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/762510 [18:50:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300662)', diff saved to https://phabricator.wikimedia.org/P20733 and previous config saved to /var/cache/conftool/dbconfig/20220214-185056-marostegui.json [18:50:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [18:50:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [18:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:01] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [18:51:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T300662)', diff saved to https://phabricator.wikimedia.org/P20734 and previous config saved to /var/cache/conftool/dbconfig/20220214-185103-marostegui.json [18:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:07] (03CR) 10Volans: [C: 03+2] Upstream release v1.0.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/762510 (owner: 10Volans) [18:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300662)', diff saved to https://phabricator.wikimedia.org/P20735 and previous config saved to /var/cache/conftool/dbconfig/20220214-185218-marostegui.json [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:20] (03PS2) 10Andrew Bogott: dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) [18:54:01] (03CR) 10jerkins-bot: [V: 04-1] dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) (owner: 10Andrew Bogott) [18:54:03] (03Merged) 10jenkins-bot: Upstream release v1.0.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/762510 (owner: 10Volans) [18:57:18] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762490 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [18:58:58] 10SRE, 10ops-codfw, 10DBA: codfw: db2136: Correctable memory error rate exceeded for DIMM_B5 - https://phabricator.wikimedia.org/T301713 (10Papaul) 05Open→03Resolved swapped B5 with A5 and did IDRAC and BIOS upgrade Firmware before ` BIOS Version 2.5.4 iDRAC Firmware Version 4.10.10.10 ` firmware a... [18:59:29] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [19:01:48] !log uploaded python3-wmflib_1.0.2 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [19:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298554)', diff saved to https://phabricator.wikimedia.org/P20736 and previous config saved to /var/cache/conftool/dbconfig/20220214-190228-ladsgroup.json [19:02:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:02:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [19:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:33] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [19:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298554)', diff saved to https://phabricator.wikimedia.org/P20737 and previous config saved to /var/cache/conftool/dbconfig/20220214-190235-ladsgroup.json [19:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:45] (03PS1) 10Ottomata: 2020.02~wmf7 - Improve env vars setting during activate and deactivate [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762514 (https://phabricator.wikimedia.org/T292699) [19:05:43] PROBLEM - SSH on wtp1027.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:07:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P20738 and previous config saved to /var/cache/conftool/dbconfig/20220214-190722-marostegui.json [19:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:03] (03CR) 10Ottomata: [C: 03+2] Actually unset env vars that are activated by conda/activate.d/env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/758983 (https://phabricator.wikimedia.org/T292699) (owner: 10Ottomata) [19:08:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Actually unset env vars that are activated by conda/activate.d/env_vars.sh [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/758983 (https://phabricator.wikimedia.org/T292699) (owner: 10Ottomata) [19:08:50] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:31] (03CR) 10Ottomata: [C: 03+2] 2020.02~wmf7 - Improve env vars setting during activate and deactivate [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762514 (https://phabricator.wikimedia.org/T292699) (owner: 10Ottomata) [19:09:33] (03CR) 10Ottomata: [V: 03+2 C: 03+2] 2020.02~wmf7 - Improve env vars setting during activate and deactivate [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762514 (https://phabricator.wikimedia.org/T292699) (owner: 10Ottomata) [19:12:52] (03PS3) 10Andrew Bogott: dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) [19:13:00] (03CR) 10Jbond: etcd: Use cfssl for peer-to-peer communication (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [19:13:31] (03CR) 10jerkins-bot: [V: 04-1] dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) (owner: 10Andrew Bogott) [19:13:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:50] (03PS4) 10Andrew Bogott: dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) [19:17:26] (03PS1) 10Ottomata: Split package into 2 different packages: anaconda-wmf-base and anaconda-wmf [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762517 [19:17:30] (03PS1) 10Ebernhardson: enable-puppet: Tell user about failure [puppet] - 10https://gerrit.wikimedia.org/r/762518 [19:17:37] (03CR) 10jerkins-bot: [V: 04-1] dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) (owner: 10Andrew Bogott) [19:18:47] (03PS5) 10Andrew Bogott: dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) [19:22:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P20739 and previous config saved to /var/cache/conftool/dbconfig/20220214-192227-marostegui.json [19:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:53] (03CR) 10EllenR: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762490 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [19:29:01] (03PS1) 10Ssingh: durum: add support for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/762521 (https://phabricator.wikimedia.org/T301165) [19:29:27] RECOVERY - puppet last run on wcqs2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:30:21] RECOVERY - puppet last run on wcqs2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:31:52] (03CR) 10Ottomata: [C: 03+2] Split package into 2 different packages: anaconda-wmf-base and anaconda-wmf [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762517 (owner: 10Ottomata) [19:31:54] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Split package into 2 different packages: anaconda-wmf-base and anaconda-wmf [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762517 (owner: 10Ottomata) [19:32:54] (03PS1) 10Ottomata: Fix env_vars - should be CXXFLAGS, not CXX_FLAGS [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762522 [19:34:13] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:32] !log prometheus2006 systemctl reset-failed [19:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:35] RECOVERY - puppet last run on wcqs2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:37:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300662)', diff saved to https://phabricator.wikimedia.org/P20740 and previous config saved to /var/cache/conftool/dbconfig/20220214-193732-marostegui.json [19:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:38] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [19:39:42] (03PS6) 10Andrew Bogott: dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) [19:39:44] (03PS1) 10Andrew Bogott: profile::wmcs::proxy::static: include sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/762524 [19:41:52] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10dpifke) Some questions: - What is the process by which these files are uploaded/updated? How often are they updated? - What is accessing them? (MediaWiki?) - What... [19:42:07] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::proxy::static: include sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/762524 (owner: 10Andrew Bogott) [19:42:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298554)', diff saved to https://phabricator.wikimedia.org/P20741 and previous config saved to /var/cache/conftool/dbconfig/20220214-194206-ladsgroup.json [19:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:04] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [19:43:10] (03Abandoned) 10Ebernhardson: Provide jwt secret to blazegraph for logging [puppet] - 10https://gerrit.wikimedia.org/r/761075 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson) [19:44:12] (03PS2) 10Ssingh: durum: add support for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/762521 (https://phabricator.wikimedia.org/T301165) [19:45:03] (03PS2) 10Andrew Bogott: profile::wmcs::proxy::static: include sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/762524 [19:45:05] (03PS7) 10Andrew Bogott: dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) [19:46:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33788/console" [puppet] - 10https://gerrit.wikimedia.org/r/762521 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [19:48:25] (03PS1) 10Ebernhardson: wcqs: Provide access token secret to blazegraph logging [puppet] - 10https://gerrit.wikimedia.org/r/762527 [19:48:47] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::proxy::static: include sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/762524 (owner: 10Andrew Bogott) [19:57:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P20742 and previous config saved to /var/cache/conftool/dbconfig/20220214-195711-ladsgroup.json [19:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:11] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:06:55] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10bcampbell) [20:08:11] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) 05Stalled→03Open Hey @Dzahn and @akosiaris I'm working Kristie Robinson from Advanceme... [20:09:27] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [20:12:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P20743 and previous config saved to /var/cache/conftool/dbconfig/20220214-201215-ladsgroup.json [20:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:41] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) >>! In T297915#7634170, @bcampbell wrote: > Hey @Dzahn can you please remove wikimania as an a... [20:24:20] !log mx/exim: removing wikimania@wikimedia.org email alias (OTRS -> ITS) (T297915) [20:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:26] T297915: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 [20:26:59] !log mx/exim: removing donate@wikimedia.org email alias (OTRS -> ITS) - was alias for fundraising@ (T297915) [20:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298554)', diff saved to https://phabricator.wikimedia.org/P20744 and previous config saved to /var/cache/conftool/dbconfig/20220214-202720-ladsgroup.json [20:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:26] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [20:29:13] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) @bcampbell both wikimania@ and donate@ have been removed on our side. I can see both should be... [20:32:25] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Hey @Dzahn my apologies, but I discovered there is one more issue that needs to be resolve... [20:33:45] !log mx/exim: re-adding donate@wikimedia.org email alias (OTRS -> ITS) (T297915) [20:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:50] T297915: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 [20:34:41] (03PS1) 10Accraze: ml-services: update editquality predictor image [deployment-charts] - 10https://gerrit.wikimedia.org/r/762532 (https://phabricator.wikimedia.org/T301415) [20:34:43] (03PS1) 10Accraze: ml-services: add arwiki & bnwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) [20:36:51] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) Hey @bcampbell no problem at all. donate@ is activated again. I just noticed something mysel... [20:39:51] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Thanks again @Dzahn. I'll circle back with Advancement and keep you updated. [20:40:19] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) [20:40:30] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) Sounds good, thank you! [20:43:24] (03PS1) 10Jbond: R:tlsproxy::localssl: Add cfssl support to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/762535 [20:46:13] (03PS2) 10Jbond: R:tlsproxy::localssl: Add cfssl support to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/762535 [20:48:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10Jclark-ctr) [20:52:59] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) [20:53:45] 10SRE, 10ops-codfw, 10DBA: codfw: db2136: Correctable memory error rate exceeded for DIMM_B5 - https://phabricator.wikimedia.org/T301713 (10Marostegui) Thanks Papaul. What would you like to do with this task? Close it an reopen if it happens again or leave it open for a few days to see how it goes? Thanks f... [20:53:54] (03PS1) 10Dzahn: rancid: rm .placeholder, ensure config dir exist, avoid puppet flap [puppet] - 10https://gerrit.wikimedia.org/r/762536 (https://phabricator.wikimedia.org/T211459) [20:54:06] 10SRE, 10ops-codfw, 10DBA: codfw: db2136: Correctable memory error rate exceeded for DIMM_B5 - https://phabricator.wikimedia.org/T301713 (10Marostegui) Oh, didn't realise you already closed it :-). Thanks again [20:55:03] 10SRE, 10ops-codfw, 10DBA: codfw: db2136: Correctable memory error rate exceeded for DIMM_B5 - https://phabricator.wikimedia.org/T301713 (10Papaul) @Marostegui re-open if we see the problem on A5 [20:55:31] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 51.42 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:56:05] 10SRE, 10Observability-Alerting, 10Patch-For-Review: rancid causes puppet to flap on netmon1002 - https://phabricator.wikimedia.org/T211459 (10Dzahn) [20:56:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [20:56:27] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33790/netmon1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/762536 (https://phabricator.wikimedia.org/T211459) (owner: 10Dzahn) [20:57:01] (03PS3) 10Jbond: R:tlsproxy::localssl: Add cfssl support to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/762535 [20:58:54] (03PS4) 10Jbond: R:tlsproxy::localssl: Add cfssl support to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/762535 [20:59:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 92.91 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220214T2100). [21:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220214T2100). [21:00:05] cjming and eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] I can deploy today [21:00:27] o/ [21:00:35] (unless cjming wishes to self-deploy?) [21:00:47] go right ahead urbanecm [21:00:51] (03CR) 10RLazarus: [C: 04-1] "Thanks for doing this! Appreciate the patch, one implementation comment only." [puppet] - 10https://gerrit.wikimedia.org/r/762518 (owner: 10Ebernhardson) [21:00:56] greetings everyone [21:01:49] hello eigyan [21:02:03] (03PS5) 10Jbond: R:tlsproxy::localssl: Add cfssl support to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/762535 [21:02:17] (03PS3) 10Urbanecm: Update Beta cluster configuration for new and existing accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761772 (https://phabricator.wikimedia.org/T301166) (owner: 10Clare Ming) [21:02:24] (03CR) 10Urbanecm: [C: 03+2] Update Beta cluster configuration for new and existing accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761772 (https://phabricator.wikimedia.org/T301166) (owner: 10Clare Ming) [21:02:37] cjming: oh, i just saw it's a beta patch [21:02:40] well, then it's easy :) [21:02:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33794/console" [puppet] - 10https://gerrit.wikimedia.org/r/762535 (owner: 10Jbond) [21:02:58] will be live in ~30 minutes [21:03:17] urbanecm: so for beta, is it the same process? [21:03:26] (03Merged) 10jenkins-bot: Update Beta cluster configuration for new and existing accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761772 (https://phabricator.wikimedia.org/T301166) (owner: 10Clare Ming) [21:03:58] cjming: almost. You +2 it as normal, SSH to the deployment host, fetch (and rebase) it there, but as long as you don't change any non-labs files, you don't need to sync anything [21:04:11] (03CR) 10Jbond: [V: 03+1 C: 04-1] "this has a bug currently" [puppet] - 10https://gerrit.wikimedia.org/r/762535 (owner: 10Jbond) [21:04:15] urbanecm: gtk - thanks [21:05:13] Could someone with wmf-nda check wether I am subscribed to T291439? [21:05:28] cjming: no problem. Once you +2 it, jenkins will automatically deploy it to beta. It should be there within 30 minutes [21:06:01] (sometimes the beta update process gets stuck; if that happens, feel free to shout for help :)) [21:06:29] (03CR) 10Urbanecm: [C: 03+2] wmf-config: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762490 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [21:06:50] eigyan: let's do your patch now: I'll ping you once it's ready for testing at mwdebug [21:07:17] (03Merged) 10jenkins-bot: wmf-config: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762490 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [21:07:28] excellent thank you urbanecm [21:07:39] eigyan: your patch is at mwebug1001 [21:07:42] can you have a look? [21:07:59] RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:08:16] zabe: well, that task's invisible even for me (having both #wmf-nda and #acl*security) [21:08:51] huh [21:09:06] note that subscribing to task doesn't necessarily grant access. It depends on the access policy. [21:09:30] (03PS2) 10Ebernhardson: enable-puppet: Tell user about failure [puppet] - 10https://gerrit.wikimedia.org/r/762518 [21:09:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:09:32] (03CR) 10Ebernhardson: enable-puppet: Tell user about failure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762518 (owner: 10Ebernhardson) [21:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:57] I can't access it either, but I got a phab notification for that task which then went away and now my notifications are bugged. [21:10:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:10:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:36] I guess it was public and now it isn't and from the name of the task I guess that I subscribed because I did not understand what it was about and I was curious [21:11:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:20] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10sbassett) Hey @dpifke - I'll try to answer some of these, though @Stran or @Mstyles may have more/better insights: > - What is the process by which these files are upl... [21:15:29] eigyan: let me know how the testing's going :) [21:15:40] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply on main [21:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:18:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:53] (03CR) 10Andrew Bogott: [C: 03+2] dynmamicproxy: move db backups off of nfs and onto a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/762509 (https://phabricator.wikimedia.org/T301715) (owner: 10Andrew Bogott) [21:19:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:57] (03PS4) 10Dzahn: site: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761662 (https://phabricator.wikimedia.org/T300568) [21:22:47] (03CR) 10Dzahn: [C: 03+2] site: remove etherpad1002 [puppet] - 10https://gerrit.wikimedia.org/r/761662 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [21:23:58] eigyan: how is the testing going? 🙂 [21:24:49] I should have some feedback shortly urbanecm [21:25:18] sure, taking your time. Just making sure you're on it eigyan. [21:25:21] *take [21:25:48] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: sync on main [21:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:07] 💯 [21:26:34] eigyan: so everything works? [21:29:27] eigyan: ^^ [21:31:07] (03PS1) 10Ebernhardson: ebernhardson: Update home dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/762542 [21:32:04] (03CR) 10Dzahn: [C: 03+2] ebernhardson: Update home dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/762542 (owner: 10Ebernhardson) [21:32:26] eigyan: can you confirm the patch works? Sorry, I'm not good at reading emojis and I prefer to be double sure before taking an action. [21:32:42] (03CR) 10Dzahn: "give it half an hour and it should have changed globally an all hosts" [puppet] - 10https://gerrit.wikimedia.org/r/762542 (owner: 10Ebernhardson) [21:36:12] urbanecm I am not seeing my change yet but am still looking [21:36:44] i see. [21:40:11] urbanecm I am not seeing any surveys loading at all for fawiki [21:41:49] it's definitely set in the configuration https://www.irccloud.com/pastebin/q49IUSMi/ [21:42:34] eigyan: but since you're (we're) unable to find a solution within almost 40 minutes, I prefer reverting it and figuring it out later. [21:42:54] you might want to set it first in -labs for fawiki labs and test there more extensively [21:43:11] "Some ad-blocking and no-script software blocks this extension." [21:43:19] (03PS1) 10Andrew Bogott: nfs-mounts.yaml.erb: remove nfs mounts for project-proxy [puppet] - 10https://gerrit.wikimedia.org/r/762543 (https://phabricator.wikimedia.org/T301715) [21:44:13] HMM I would rather not revert the change as it works in labs and we have a team ready to test in farsi -> so I would want to be premature in aborting [21:44:44] *would not want to be [21:45:41] urbanecm it works in labs btw [21:45:47] well, it clearly doesn't work in production 🙂 [21:45:57] lol [21:45:58] wow [21:46:00] ok [21:46:08] np do what you feel is best [21:47:09] where would you see whether it works? [21:47:27] mutante: i use I'm a bit reluctant to deploy something that doesn't appear to have the intended effect. Granted [21:47:28] I can see the extension is installed on fawiki, but then what else? [21:47:32] * https://fa.wikipedia.org/wiki/?quicksurvey=internal-gdi-safety-survey&safemode=1&useskin=vector [21:49:01] eigyan: to be honest I'm a bit reluctant to deploy something that doesn't appear to have the intended effect [21:49:23] hmm. so https://fa.wikipedia.beta.wmflabs.org/wiki/?quicksurvey=internal-gdi-safety-survey&safemode=1&useskin=vector ? [21:49:32] yeah, and that doesn't work on my end either [21:49:44] eigyan: why do you think it works on labs? :))) [21:50:08] those 2 pages look different but I can't tell if the survey works [21:50:41] np thanks guys [21:50:47] signing off [21:51:27] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:51:29] eigyan: I would recommend to revert if it's unknown why it does not work [21:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:55] They left [21:51:57] ..... [21:52:03] reverting [21:52:11] (03PS1) 10Urbanecm: Revert "wmf-config: Deploy the fawiki test safety survey to production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761764 (https://phabricator.wikimedia.org/T297629) [21:52:24] (03CR) 10Urbanecm: [C: 03+2] Revert "wmf-config: Deploy the fawiki test safety survey to production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761764 (https://phabricator.wikimedia.org/T297629) (owner: 10Urbanecm) [21:52:33] ... [21:53:07] (03Merged) 10jenkins-bot: Revert "wmf-config: Deploy the fawiki test safety survey to production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761764 (https://phabricator.wikimedia.org/T297629) (owner: 10Urbanecm) [21:53:12] (03PS4) 10Ssingh: Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) [21:58:37] (03CR) 10Ssingh: [C: 04-1] "This CR needs more discussion but I am keeping it updated in case the current solution is acceptable or we can build upon." [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [21:59:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:05] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220214T2200). [22:01:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:01:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:52] 10SRE, 10ops-codfw, 10decommission-hardware: decommission auth2001.codfw.wmnet - https://phabricator.wikimedia.org/T301546 (10Papaul) [22:13:19] 10SRE, 10ops-codfw, 10decommission-hardware: decommission auth2001.codfw.wmnet - https://phabricator.wikimedia.org/T301546 (10Papaul) 05Open→03Resolved complete [22:13:22] 10SRE, 10ops-codfw, 10decommission-hardware: decommission prometheus2003.codfw.wmnet - https://phabricator.wikimedia.org/T301465 (10Papaul) [22:13:51] 10SRE, 10ops-codfw, 10decommission-hardware: decommission prometheus2003.codfw.wmnet - https://phabricator.wikimedia.org/T301465 (10Papaul) 05Open→03Resolved complete [22:14:29] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10decommission-hardware: Decommission ms-fe200[5-8].codfw.wmnet - https://phabricator.wikimedia.org/T301334 (10Papaul) [22:14:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10decommission-hardware: Decommission ms-fe200[5-8].codfw.wmnet - https://phabricator.wikimedia.org/T301334 (10Papaul) 05Open→03Resolved complete [23:07:47] 10SRE, 10ops-codfw, 10DBA: codfw: db2136: Correctable memory error rate exceeded for DIMM_B5 - https://phabricator.wikimedia.org/T301713 (10Ladsgroup) I started mysql and waiting for it to catch up with replication, once done, I repool it. [23:29:36] (03PS1) 10Ladsgroup: db-production: Stop writes to es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762557 (https://phabricator.wikimedia.org/T300976) [23:32:11] (03CR) 10Ladsgroup: "root@es1025.eqiad.wmnet[fawiki]> show tables;" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762557 (https://phabricator.wikimedia.org/T300976) (owner: 10Ladsgroup) [23:41:05] (03PS1) 10Ladsgroup: mariadb: Promote es1023 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/762558 (https://phabricator.wikimedia.org/T300976)