[00:02:17] PROBLEM - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:05:25] (03PS3) 10Dylsss: Dumps: Clarify licensing for Wikidata and update various links [puppet] - 10https://gerrit.wikimedia.org/r/730243 (https://phabricator.wikimedia.org/T279436) [00:16:43] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [00:17:40] (03CR) 10Krinkle: Revert "Revert "static.php: Add support for /static/current rewrites"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730182 (owner: 10Giuseppe Lavagetto) [00:18:10] (03PS3) 10Krinkle: Revert "Revert "static.php: Add support for /static/current rewrites"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730182 (owner: 10Giuseppe Lavagetto) [00:18:47] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [00:19:00] (03PS4) 10Krinkle: static.php: Add support for /static/current rewrites (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730182 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [00:20:56] (03CR) 10Krinkle: "TODO for later: Figure out why Phan isn't running here (which would have trivially caught this in CI)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730182 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [00:55:17] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:03:13] RECOVERY - SSH on gerrit2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:10:09] (03PS1) 10Andrew Bogott: profile::manifests::toolforge::disable_tool: fix password/dn mixup [puppet] - 10https://gerrit.wikimedia.org/r/730387 (https://phabricator.wikimedia.org/T170355) [01:10:11] (03PS1) 10Andrew Bogott: disable_tool.conf.erb: add some additional entries for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/730388 (https://phabricator.wikimedia.org/T170355) [01:14:13] (03CR) 10Andrew Bogott: [C: 03+2] profile::manifests::toolforge::disable_tool: fix password/dn mixup [puppet] - 10https://gerrit.wikimedia.org/r/730387 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [01:14:21] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool.conf.erb: add some additional entries for tool archiving [puppet] - 10https://gerrit.wikimedia.org/r/730388 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [02:07:19] (03CR) 10Gergő Tisza: [C: 03+1] updatementeedata.pp: Update script parameters [puppet] - 10https://gerrit.wikimedia.org/r/728656 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [02:56:57] (03PS1) 10Andrew Bogott: disable_tool.conf: remove some spurious quotemarks [puppet] - 10https://gerrit.wikimedia.org/r/730393 (https://phabricator.wikimedia.org/T170355) [02:57:13] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:58:59] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool.conf: remove some spurious quotemarks [puppet] - 10https://gerrit.wikimedia.org/r/730393 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [04:06:57] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:12] (03PS1) 10Gergő Tisza: Add Link: Do not log "no suggestion found" errors in production log [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730370 (https://phabricator.wikimedia.org/T291251) [06:17:50] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) Checked a little bit more the top 10 high traffic topics, and `eqiad.change-prop.transcludes.resource-change` (4th) still runs with a single partition with 500 msg/s. I would i... [06:24:23] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) ` kafka topics --alter --topic eqiad.change-prop.transcludes.resource-change --partitions 3 kafka topics --alter --topic codfw.change-prop.transcludes.resource-change --partiti... [06:26:30] !log `kafka topics --alter --topic {eqiad,codfw}.change-prop.transcludes.resource-change --partitions 3` on kafka-main2001 - T288825 [06:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:40] T288825: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 [06:58:17] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:31] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:49] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:11:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/730247 (owner: 10Volans) [07:11:47] (03PS1) 10Kosta Harlan: Suggested Edits: Update local config.presets when topics/difficulty presets change [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730371 (https://phabricator.wikimedia.org/T292536) [07:11:56] <_joe_> jouncebot: next [07:11:56] In 3 hour(s) and 48 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1100) [07:12:09] (03CR) 10Filippo Giunchedi: graphite: disable tags support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/729968 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [07:12:32] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: deploy global alerts only on hosts running thanos-rule [puppet] - 10https://gerrit.wikimedia.org/r/730206 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [07:12:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: actually pass requests for /static/current to static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/730212 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [07:13:52] !log provision new eqsin-ulsfo link - T273308 [07:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:17] (03PS1) 10Filippo Giunchedi: thanos: remove explicit require to avoid loops [puppet] - 10https://gerrit.wikimedia.org/r/730406 [07:18:29] (03PS6) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) [07:18:31] (03Merged) 10jenkins-bot: mediawiki: actually pass requests for /static/current to static.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/730212 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [07:18:57] (03PS2) 10Elukey: Set lvs_setup status for the inference service [puppet] - 10https://gerrit.wikimedia.org/r/720009 (https://phabricator.wikimedia.org/T289835) [07:25:45] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) In main-eqiad the starting point can be the high traffic topics: ` kafka topics --alter --topic eqiad.resource-purge --partitions 5 kafka topics --alter --topic codfw.resource... [07:33:41] !log increase kafka topic partition size of the top 4 high traffic topics of main-eqiad as described in https://phabricator.wikimedia.org/T288825#7422726 [07:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:28] (03PS2) 10Filippo Giunchedi: thanos: use include not require [puppet] - 10https://gerrit.wikimedia.org/r/730406 [07:38:30] (03PS1) 10Filippo Giunchedi: pontoon: add settings for thanos rule_hosts [puppet] - 10https://gerrit.wikimedia.org/r/730407 [07:40:07] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add settings for thanos rule_hosts [puppet] - 10https://gerrit.wikimedia.org/r/730407 (owner: 10Filippo Giunchedi) [07:40:16] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: use include not require [puppet] - 10https://gerrit.wikimedia.org/r/730406 (owner: 10Filippo Giunchedi) [07:46:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] role::ml_k8s::worker: add LVS configuration for the inference svc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [07:47:33] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) The codfw.resource-purge topic got expanded to: ` Topic:codfw.resource-purge PartitionCount:5 ReplicationFactor:3 Configs: Topic: codfw.resource-purge... [07:47:51] (03CR) 10David Caro: remote: use only the last line for the uptime (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/730270 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [07:53:13] (03PS5) 10Muehlenhoff: Move swiftrepl to a Hiera option and obsolete role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/728378 [07:56:29] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:58:31] (03CR) 10JMeybohm: [C: 03+1] hiera::deployment_server add missing mathoid helm3 deploy user [puppet] - 10https://gerrit.wikimedia.org/r/730199 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:00:04] (03PS4) 10David Caro: base::sysctl::core_dumps: move core_dumps to their own class [puppet] - 10https://gerrit.wikimedia.org/r/728457 [08:00:08] (03CR) 10David Caro: base::sysctl::core_dumps: move core_dumps to their own class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728457 (owner: 10David Caro) [08:06:49] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:08:57] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] graphite: move production to /srv/carbon as storage directory [puppet] - 10https://gerrit.wikimedia.org/r/729975 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [08:09:04] (03PS2) 10Filippo Giunchedi: graphite: move production to /srv/carbon as storage directory [puppet] - 10https://gerrit.wikimedia.org/r/729975 (https://phabricator.wikimedia.org/T247963) [08:10:56] (03PS4) 10Volans: dhcp: add support for MAC address based config [software/spicerack] - 10https://gerrit.wikimedia.org/r/730030 (https://phabricator.wikimedia.org/T269855) [08:10:58] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) Adjustments for main-eqiad: ` Topic:eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite PartitionCount:5 ReplicationFactor:3 Configs:... [08:14:23] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31636/console" [puppet] - 10https://gerrit.wikimedia.org/r/730363 (https://phabricator.wikimedia.org/T293157) (owner: 10BryanDavis) [08:15:05] !log bounce graphite on graphite1004 to apply new config [08:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:23] PROBLEM - Host doh5002 is DOWN: PING CRITICAL - Packet loss = 100% [08:17:41] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) ` Topic:eqiad.mediawiki.job.cirrusSearchElasticaWrite PartitionCount:5 ReplicationFactor:3 Configs: Topic: eqiad.mediawiki.job.cirrusSearchElasticaWrite... [08:17:53] RECOVERY - Host doh5002 is UP: PING OK - Packet loss = 0%, RTA = 242.22 ms [08:18:17] (03PS1) 10Kormat: debian: Raise the open files limit [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/730413 [08:21:20] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Raise the open files limit [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/730413 (owner: 10Kormat) [08:21:38] !log run kafka preferred-replica-election on kafka-main1001 to rebalance partition leaders - T288825 [08:21:39] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:43] T288825: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 [08:21:48] (03CR) 10Ema: [V: 03+1 C: 03+2] cache: Allow PATCH method to be passed to backend services [puppet] - 10https://gerrit.wikimedia.org/r/730363 (https://phabricator.wikimedia.org/T293157) (owner: 10BryanDavis) [08:22:11] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:24:11] (03CR) 10Jelto: [C: 03+2] hiera::deployment_server add missing mathoid helm3 deploy user [puppet] - 10https://gerrit.wikimedia.org/r/730199 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:26:33] (03PS1) 10Ema: varnish: test allowed HTTP methods [puppet] - 10https://gerrit.wikimedia.org/r/730415 (https://phabricator.wikimedia.org/T293157) [08:27:07] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) Two little fine-tune changes for main-codfw topics (inspired from the work done on main-eqiad): ` Topic:codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite... [08:28:13] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Update to CAS 6.4 - https://phabricator.wikimedia.org/T293186 (10MoritzMuehlenhoff) [08:28:57] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:02] (03CR) 10Muehlenhoff: remote: use only the last line for the uptime (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/730270 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [08:31:07] (03PS1) 10Volans: install_server: uniform DHCP snippet automation [puppet] - 10https://gerrit.wikimedia.org/r/730416 (https://phabricator.wikimedia.org/T269855) [08:32:59] (03CR) 10Volans: "Related Puppet patch is: I2f5b85933fecc768673c0d01c6317173b067f194" [software/spicerack] - 10https://gerrit.wikimedia.org/r/730030 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [08:33:40] 10SRE, 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) Next steps: * come up with a topic-mappr plan for main-eqiad as done in T288825#7281471, excluding the aforementioned top 3 topics already running with 5 partitions. [08:33:47] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:33:51] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 145 probes of 707 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:34:17] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:34:56] godog: I see we have icinga-wm_, is there any way to get it back to icinga-wm without restarting it? or should I just force a restart? [08:35:13] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:30] (03CR) 10Volans: "Compiler results:" [puppet] - 10https://gerrit.wikimedia.org/r/730416 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [08:35:59] volans: afaik a restart is needed [08:36:01] PROBLEM - Host cp5016 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:03] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:03] PROBLEM - Host cp5010 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:07] PROBLEM - Host cp5008 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:07] PROBLEM - Host cp5009 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:07] PROBLEM - Host cp5011 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:11] PROBLEM - Host dns5001 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:24] XioNoX: did we just lost eqsin? need to depool? [08:36:29] PROBLEM - Host ncredir5002 is DOWN: PING CRITICAL - Packet loss = 100% [08:36:35] RECOVERY - Host cp5009 is UP: PING WARNING - Packet loss = 71%, RTA = 241.10 ms [08:36:37] RECOVERY - Host cp5010 is UP: PING OK - Packet loss = 0%, RTA = 242.93 ms [08:36:37] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 242.70 ms [08:36:37] RECOVERY - Host cp5008 is UP: PING OK - Packet loss = 0%, RTA = 242.95 ms [08:36:37] RECOVERY - Host cp5011 is UP: PING OK - Packet loss = 0%, RTA = 242.40 ms [08:36:37] RECOVERY - Host cp5016 is UP: PING OK - Packet loss = 0%, RTA = 242.78 ms [08:36:37] RECOVERY - Host dns5001 is UP: PING OK - Packet loss = 0%, RTA = 242.49 ms [08:36:41] RECOVERY - Host ncredir5002 is UP: PING OK - Packet loss = 0%, RTA = 241.68 ms [08:37:10] I can ssh there, checking graphs, could have been just the OSPF flip above [08:37:53] I just stepped away from my laptop [08:39:53] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 20 probes of 707 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:40:10] I don't see any maintenance notification but seems to match a telia transport failure [08:40:58] (03PS1) 10Elukey: custom.deploy.d: change the ml-serve' istio ingress node port [deployment-charts] - 10https://gerrit.wikimedia.org/r/730419 (https://phabricator.wikimedia.org/T289835) [08:41:25] (03PS3) 10Filippo Giunchedi: TEMP profile: restart postgres on first install / bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705704 [08:41:27] (03PS3) 10Filippo Giunchedi: TEMP? sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 [08:41:29] (03PS1) 10Filippo Giunchedi: TEMP: disable kafka cert checking [puppet] - 10https://gerrit.wikimedia.org/r/730420 [08:41:31] (03PS1) 10Filippo Giunchedi: TEMP: set host-based service::catalog entries to status: lvs_setup to get icinga to work [puppet] - 10https://gerrit.wikimedia.org/r/730421 [08:41:33] (03PS1) 10Filippo Giunchedi: TEMP: apply changes to get cumin to work [puppet] - 10https://gerrit.wikimedia.org/r/730422 [08:41:35] (03PS1) 10Filippo Giunchedi: pontoon: enable sd for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/730423 [08:41:37] (03PS1) 10Filippo Giunchedi: TEMP: enable pontoon sd in base [puppet] - 10https://gerrit.wikimedia.org/r/730424 [08:41:39] (03PS1) 10Filippo Giunchedi: TEMP: use localhost for carbon relay [puppet] - 10https://gerrit.wikimedia.org/r/730425 [08:41:41] (03PS1) 10Filippo Giunchedi: TEMP: set localhost as graphite_host [puppet] - 10https://gerrit.wikimedia.org/r/730426 [08:41:43] oooff, sorry [08:41:43] (03PS1) 10Filippo Giunchedi: graphite: expire metric files not updated for 3y [puppet] - 10https://gerrit.wikimedia.org/r/730427 (https://phabricator.wikimedia.org/T247963) [08:41:59] most of these reviews weren't meant to go out [08:42:57] (03Abandoned) 10Filippo Giunchedi: TEMP: set localhost as graphite_host [puppet] - 10https://gerrit.wikimedia.org/r/730426 (owner: 10Filippo Giunchedi) [08:43:03] (03CR) 10ZPapierski: [C: 03+1] kafka: docstrings minor improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/730195 (owner: 10Volans) [08:43:05] (03Abandoned) 10Filippo Giunchedi: TEMP: use localhost for carbon relay [puppet] - 10https://gerrit.wikimedia.org/r/730425 (owner: 10Filippo Giunchedi) [08:43:12] (03Abandoned) 10Filippo Giunchedi: TEMP: enable pontoon sd in base [puppet] - 10https://gerrit.wikimedia.org/r/730424 (owner: 10Filippo Giunchedi) [08:43:14] (03CR) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [08:43:17] (03Abandoned) 10Filippo Giunchedi: pontoon: enable sd for stack observability [puppet] - 10https://gerrit.wikimedia.org/r/730423 (owner: 10Filippo Giunchedi) [08:43:22] (03Abandoned) 10Filippo Giunchedi: TEMP: apply changes to get cumin to work [puppet] - 10https://gerrit.wikimedia.org/r/730422 (owner: 10Filippo Giunchedi) [08:43:29] (03Abandoned) 10Filippo Giunchedi: TEMP: set host-based service::catalog entries to status: lvs_setup to get icinga to work [puppet] - 10https://gerrit.wikimedia.org/r/730421 (owner: 10Filippo Giunchedi) [08:43:35] (03Abandoned) 10Filippo Giunchedi: TEMP: disable kafka cert checking [puppet] - 10https://gerrit.wikimedia.org/r/730420 (owner: 10Filippo Giunchedi) [08:43:40] (03CR) 10David Caro: remote: use only the last line for the uptime (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/730270 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [08:45:38] (03CR) 10Volans: [C: 03+2] kafka: docstrings minor improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/730195 (owner: 10Volans) [08:48:21] (03PS7) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) [08:48:23] (03PS3) 10Elukey: Set lvs_setup status for the inference service [puppet] - 10https://gerrit.wikimedia.org/r/720009 (https://phabricator.wikimedia.org/T289835) [08:48:57] (03CR) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [08:49:38] volans: the outstanding dns changes are mine. I will push them when I'm back on my laptop [08:50:00] XioNoX: ack [08:53:10] (03CR) 10Phuedx: [C: 03+1] "This LGTM. AIUI this is a NOP until the patch that depends on this is merged and rolls out on the train (likely next Thursday, 21st Octobe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725161 (https://phabricator.wikimedia.org/T289622) (owner: 10Nray) [08:53:33] (03CR) 10Jbond: [C: 03+1] "LGTM, thx" [software/ecs] - 10https://gerrit.wikimedia.org/r/726226 (owner: 10Cwhite) [08:54:01] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:10] (03CR) 10Ema: [C: 03+2] varnish: test allowed HTTP methods [puppet] - 10https://gerrit.wikimedia.org/r/730415 (https://phabricator.wikimedia.org/T293157) (owner: 10Ema) [08:55:06] (03CR) 10Majavah: "minor nit in commit message" [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [08:55:23] (03Merged) 10jenkins-bot: kafka: docstrings minor improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/730195 (owner: 10Volans) [08:55:25] (03Merged) 10jenkins-bot: changelog: fix typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/730198 (owner: 10Volans) [08:56:02] (03CR) 10Jbond: [C: 03+1] sre.ganeti.makevm: add Netbox sync on rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/730247 (owner: 10Volans) [09:02:24] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: add Netbox sync on rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/730247 (owner: 10Volans) [09:02:37] 10SRE, 10Toolhub, 10Traffic, 10Patch-For-Review: Toolhub API requests with PATCH verbs blocked by CDN - https://phabricator.wikimedia.org/T293157 (10ema) @bd808: looks like we're all set! ` $ curl -sv -X PATCH https://phabricator.wikimedia.org/T293157 2>&1 | grep '< HTTP/2' < HTTP/2 200 ` [09:03:27] (03CR) 10Jobo: [C: 03+2] admin/otrs: create new root admin group vrts-admins, add Arnold [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [09:05:42] (03Merged) 10jenkins-bot: sre.ganeti.makevm: add Netbox sync on rollback [cookbooks] - 10https://gerrit.wikimedia.org/r/730247 (owner: 10Volans) [09:08:16] (03CR) 10Muehlenhoff: [C: 03+2] Move swiftrepl to a Hiera option and obsolete role::swift::swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/728378 (owner: 10Muehlenhoff) [09:12:00] (03CR) 10Btullis: [C: 03+2] Add the ecs_170 tag to the jupyterjab log pipeline [puppet] - 10https://gerrit.wikimedia.org/r/729957 (https://phabricator.wikimedia.org/T288348) (owner: 10Btullis) [09:12:16] (03CR) 10Jbond: [C: 03+1] "LGTM see minor comment" [software/spicerack] - 10https://gerrit.wikimedia.org/r/730030 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [09:13:26] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10MunizaA) @CDanis I've signed it now. Thanks! [09:14:22] (03PS3) 10Vgutierrez: acme_chief: Add systemd based watchdog [software/acme-chief] - 10https://gerrit.wikimedia.org/r/728379 (https://phabricator.wikimedia.org/T292619) [09:14:45] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/730030 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [09:16:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-ema: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 (10ema) [09:17:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/728457 (owner: 10David Caro) [09:18:38] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Add systemd based watchdog [software/acme-chief] - 10https://gerrit.wikimedia.org/r/728379 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:19:14] (03CR) 10Jbond: [C: 03+1] install_server: uniform DHCP snippet automation [puppet] - 10https://gerrit.wikimedia.org/r/730416 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [09:21:04] (03CR) 10Jcrespo: [C: 03+2] "I am going to deploy this on cumin hosts, having gotten at least one dba approval, but please @Lsobanski, when Manuel is back from vacatio" [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [09:21:14] (03CR) 10Jbond: [C: 03+1] dhcp: add support for MAC address based config (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/730030 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [09:21:57] (03CR) 10Volans: [C: 03+2] dhcp: add support for MAC address based config [software/spicerack] - 10https://gerrit.wikimedia.org/r/730030 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [09:22:38] !log ema@cumin2002 START - Cookbook sre.hosts.reimage for host cp4021.ulsfo.wmnet with OS buster [09:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-ema: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ema@cumin2002 for host cp4021.ulsfo.wmnet with OS buster [09:24:55] 10SRE, 10Traffic, 10User-ema: Create runbook for VarnishTrafficDrop alert, change dashboard link - https://phabricator.wikimedia.org/T292820 (10ema) 05Open→03Resolved a:03ema Runbook and updated dashboard link are shown correctly. Closing. ` 06:00 < jinxer-wm> (VarnishTrafficDrop) resolved: 68% GET dr... [09:25:50] (03PS4) 10Vgutierrez: acme_chief: Support systemd watchdog [puppet] - 10https://gerrit.wikimedia.org/r/730016 (https://phabricator.wikimedia.org/T292619) [09:26:35] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Support systemd watchdog [puppet] - 10https://gerrit.wikimedia.org/r/730016 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [09:27:38] (03PS1) 10Muehlenhoff: Remove stray owner lines [puppet] - 10https://gerrit.wikimedia.org/r/730434 [09:27:51] (03Merged) 10jenkins-bot: dhcp: add support for MAC address based config [software/spicerack] - 10https://gerrit.wikimedia.org/r/730030 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [09:29:05] (03PS1) 10Jcrespo: db-kill: Fix wmfmariadbpy package dependency [puppet] - 10https://gerrit.wikimedia.org/r/730435 [09:29:42] (03CR) 10Muehlenhoff: [C: 03+2] Remove stray owner lines [puppet] - 10https://gerrit.wikimedia.org/r/730434 (owner: 10Muehlenhoff) [09:33:23] (03PS6) 10Muehlenhoff: Obsolete role::restbase::base [puppet] - 10https://gerrit.wikimedia.org/r/729943 [09:36:22] (03CR) 10Jcrespo: [C: 03+2] db-kill: Fix wmfmariadbpy package dependency [puppet] - 10https://gerrit.wikimedia.org/r/730435 (owner: 10Jcrespo) [09:39:50] (03PS1) 10Lucas Werkmeister (WMDE): Instantiate ItemId for SiteLinkConflictLookup results [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730380 (https://phabricator.wikimedia.org/T293104) [09:40:03] (03CR) 10Jcrespo: "Original patch errored when applied. Not super happy with this fix, but wanted to deploy something quickly to not have an error. Please he" [puppet] - 10https://gerrit.wikimedia.org/r/730435 (owner: 10Jcrespo) [09:43:55] (03PS1) 10Jgiannelos: tegola-vector-tiles: Allow access to kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/730436 [09:46:19] (03PS1) 10Lucas Werkmeister (WMDE): Instantiate ItemId for SiteLinkConflictLookup results [extensions/Wikibase] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730385 (https://phabricator.wikimedia.org/T293104) [09:48:01] (03CR) 10Elukey: [C: 03+2] custom.deploy.d: change the ml-serve' istio ingress node port [deployment-charts] - 10https://gerrit.wikimedia.org/r/730419 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [09:48:13] o/ I'm going to merge this Beta Cluster-only change on behalf the Trust & Safety Tools team: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/728490 [09:48:25] I'll pull it onto the deployment host once it's merged [09:48:48] phuedx: I would suggest also syncing the file to the cluster [09:49:02] so that people won't be surprised when it's their turn to sync [09:49:10] joe: Can do :) [09:49:15] but I'm not sure what the docs suggest to do nowadays :) [09:49:18] (03PS2) 10Jgiannelos: tegola-vector-tiles: Allow access to kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/730436 (https://phabricator.wikimedia.org/T293134) [09:49:25] It'll also have a line item in the SAL too [09:49:41] that was the wisdom last time I made a labs-only change, which was probably 5 years ago :P [09:50:59] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: Allow access to kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/730436 (https://phabricator.wikimedia.org/T293134) (owner: 10Jgiannelos) [09:51:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:52:56] (03CR) 10Phuedx: [C: 03+2] Add more types of QuickSurveys on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728490 (https://phabricator.wikimedia.org/T292459) (owner: 10Jhernandez) [09:53:40] (03Merged) 10jenkins-bot: Add more types of QuickSurveys on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728490 (https://phabricator.wikimedia.org/T292459) (owner: 10Jhernandez) [09:54:30] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:55:55] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Allow access to kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/730436 (https://phabricator.wikimedia.org/T293134) (owner: 10Jgiannelos) [09:59:31] (03CR) 10Muehlenhoff: "Looks great, two nits/suggestions inline." [puppet] - 10https://gerrit.wikimedia.org/r/729970 (owner: 10Jbond) [09:59:54] (03Merged) 10jenkins-bot: tegola-vector-tiles: Allow access to kafka clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/730436 (https://phabricator.wikimedia.org/T293134) (owner: 10Jgiannelos) [10:01:56] (03PS1) 10Jbond: mediawiki: add get_primary_dc function [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 [10:02:01] Just verifying the change with a member of the team [10:02:05] joakino ^ [10:02:20] (03PS1) 10Muehlenhoff: Sync content of Hiera contact information and owners.yaml [puppet] - 10https://gerrit.wikimedia.org/r/730441 [10:02:33] Looking good phuedx [10:04:04] (03CR) 10Majavah: kubeadm: add helm-diff (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/729577 (https://phabricator.wikimedia.org/T292771) (owner: 10Majavah) [10:04:46] joakino and I have tested the change. syncing [10:06:49] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [10:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:53] !log phuedx@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:728490|Add more types of QuickSurveys on beta cluster (T292459)]] (duration: 01m 53s) [10:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:01] T292459: Deploy all types of surveys to beta cluster - https://phabricator.wikimedia.org/T292459 [10:10:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:57] (03PS1) 10MVernon: codfw-prod: more weight to ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730442 (https://phabricator.wikimedia.org/T290881) [10:13:34] (03PS1) 10Kormat: wmfmariadbpy: Explicitly require the library package [puppet] - 10https://gerrit.wikimedia.org/r/730443 [10:13:36] (03CR) 10MVernon: "Same increment as on Monday :)" [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730442 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [10:14:49] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31639/console" [puppet] - 10https://gerrit.wikimedia.org/r/730443 (owner: 10Kormat) [10:15:02] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [10:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:17] (03CR) 10Filippo Giunchedi: [C: 03+1] codfw-prod: more weight to ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730442 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [10:23:20] (03PS5) 10Jbond: systemd::sysuser: refactor to provide some useful defaults [puppet] - 10https://gerrit.wikimedia.org/r/729970 [10:24:14] (03CR) 10Kormat: [V: 03+1 C: 03+2] wmfmariadbpy: Explicitly require the library package [puppet] - 10https://gerrit.wikimedia.org/r/730443 (owner: 10Kormat) [10:25:21] (03CR) 10MVernon: [V: 03+2 C: 03+2] codfw-prod: more weight to ms-be2045 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/730442 (https://phabricator.wikimedia.org/T290881) (owner: 10MVernon) [10:25:47] (03PS6) 10Jbond: systemd::sysuser: refactor to provide some useful defaults [puppet] - 10https://gerrit.wikimedia.org/r/729970 [10:26:04] (03PS4) 10Vgutierrez: acme_chief: Add systemd based watchdog support [software/acme-chief] - 10https://gerrit.wikimedia.org/r/728379 (https://phabricator.wikimedia.org/T292619) [10:28:15] (03PS5) 10Jbond: systemd::sysuser: add more error checking [puppet] - 10https://gerrit.wikimedia.org/r/729995 [10:28:26] (03PS3) 10Jbond: systemd::sysuser: also manage a user resource [puppet] - 10https://gerrit.wikimedia.org/r/730012 [10:28:38] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/729970 (owner: 10Jbond) [10:29:09] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:30:10] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/730441 (owner: 10Muehlenhoff) [10:30:38] (03CR) 10jerkins-bot: [V: 04-1] systemd::sysuser: add more error checking [puppet] - 10https://gerrit.wikimedia.org/r/729995 (owner: 10Jbond) [10:31:02] (03CR) 10jerkins-bot: [V: 04-1] systemd::sysuser: also manage a user resource [puppet] - 10https://gerrit.wikimedia.org/r/730012 (owner: 10Jbond) [10:31:54] (03PS1) 10Filippo Giunchedi: wait for config before starting pg [puppet] - 10https://gerrit.wikimedia.org/r/730479 [10:32:21] (03CR) 10Ayounsi: [C: 03+1] install_server: uniform DHCP snippet automation [puppet] - 10https://gerrit.wikimedia.org/r/730416 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [10:33:05] (03PS4) 10Filippo Giunchedi: profile: restart postgres on first install / bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705704 [10:33:07] (03PS2) 10Filippo Giunchedi: wait for config before starting pg [puppet] - 10https://gerrit.wikimedia.org/r/730479 [10:33:35] (03CR) 10jerkins-bot: [V: 04-1] wait for config before starting pg [puppet] - 10https://gerrit.wikimedia.org/r/730479 (owner: 10Filippo Giunchedi) [10:33:41] (03PS4) 10Filippo Giunchedi: sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) [10:33:59] (03CR) 10Muehlenhoff: [C: 03+2] Sync content of Hiera contact information and owners.yaml [puppet] - 10https://gerrit.wikimedia.org/r/730441 (owner: 10Muehlenhoff) [10:34:46] (03PS1) 10Elukey: custom_deploy.d: change ml-serve's istio gateway https port [deployment-charts] - 10https://gerrit.wikimedia.org/r/730480 [10:37:58] 10SRE-tools, 10Infrastructure-Foundations: Netbox check: the Uncommitted DNS changes in Netbox should recover more quickly - https://phabricator.wikimedia.org/T293206 (10Volans) p:05Triage→03Medium [10:37:59] (03PS5) 10Vgutierrez: acme_chief: Support systemd watchdog [puppet] - 10https://gerrit.wikimedia.org/r/730016 (https://phabricator.wikimedia.org/T292619) [10:38:41] (03CR) 10Jhernandez: "As part of https://phabricator.wikimedia.org/T292459 we deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/728490/ wh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604681 (https://phabricator.wikimedia.org/T255130) (owner: 10Awight) [10:39:43] (03CR) 10Jhernandez: [C: 04-1] "As part of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/728490/ which tidies up the beta cluster config we also removed " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604895 (owner: 10Awight) [10:42:33] (03PS1) 10Jgiannelos: tile-pregeneration: Add snappy codec support for kafka [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730481 (https://phabricator.wikimedia.org/T293204) [10:43:09] (03CR) 10JMeybohm: kubeadm: add helm-diff (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/729577 (https://phabricator.wikimedia.org/T292771) (owner: 10Majavah) [10:44:48] (03CR) 10Jgiannelos: "Here is the relevant section from kafka-python documentation:" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730481 (https://phabricator.wikimedia.org/T293204) (owner: 10Jgiannelos) [10:46:12] (03CR) 10Hnowlan: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/729943 (owner: 10Muehlenhoff) [10:48:11] (03PS1) 10Muehlenhoff: Sync more content of Hiera contact information and owners.yaml [puppet] - 10https://gerrit.wikimedia.org/r/730482 [10:48:41] (03CR) 10Majavah: kubeadm: add helm-diff (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/729577 (https://phabricator.wikimedia.org/T292771) (owner: 10Majavah) [10:53:15] kostajh: are you around for the upcoming deployment window? [10:53:44] Lucas_WMDE: yeah [10:53:49] (03CR) 10Volans: "LGTM, minor nits inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 (owner: 10Jbond) [10:53:52] ok, then I’ll start +2ing changes already, to save some time [10:53:55] thanks! [10:54:01] (03CR) 10Muehlenhoff: "Looks good, one last comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/729970 (owner: 10Jbond) [10:54:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Suggested Edits: Update local config.presets when topics/difficulty presets change [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730371 (https://phabricator.wikimedia.org/T292536) (owner: 10Kosta Harlan) [10:54:13] Lucas_WMDE: danke [10:56:09] (03PS1) 10Jcrespo: Revert "db-kill: Fix wmfmariadbpy package dependency" [puppet] - 10https://gerrit.wikimedia.org/r/730487 [10:59:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add Link: Do not log "no suggestion found" errors in production log [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730370 (https://phabricator.wikimedia.org/T291251) (owner: 10Gergő Tisza) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1100). [11:00:05] kostajh and Lucas_WMDE: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:08] (03CR) 10Jelto: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [11:00:09] o/ [11:00:12] I can deploy [11:00:16] hello again :) [11:00:34] hi ;) [11:00:42] or do you want to self-serve? I forgot you’re also a deployer [11:01:03] (I’ll go ahead and +2 my Wikibase backports, those will take a while in CI) [11:01:07] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Instantiate ItemId for SiteLinkConflictLookup results [extensions/Wikibase] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730385 (https://phabricator.wikimedia.org/T293104) (owner: 10Lucas Werkmeister (WMDE)) [11:01:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Instantiate ItemId for SiteLinkConflictLookup results [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730380 (https://phabricator.wikimedia.org/T293104) (owner: 10Lucas Werkmeister (WMDE)) [11:02:03] 10SRE-tools, 10Infrastructure-Foundations: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10Volans) p:05Triage→03Medium [11:02:03] Lucas_WMDE: sure, I can do it [11:02:06] ok [11:02:10] then we just wait for CI for now [11:02:43] ⏳ [11:03:01] https://memegenerator.net/img/instances/65289046/waiting-for-jenkins-to-finish-build.jpg [11:03:11] lol [11:04:04] PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:04:25] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1013.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1006.eqiad.wmnet are marked down but pooled: [11:04:25] Servers wdqs1004.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:04:50] doh [11:05:11] * volans acked VO alert [11:05:17] pinged on -search too [11:05:48] !log ema@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4021.ulsfo.wmnet with OS buster [11:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-ema: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ema@cumin2002 for host cp4021.ulsfo.wmnet with OS buster completed: - cp... [11:06:06] gehel maybe? [11:06:47] (03CR) 10JMeybohm: kubeadm: add helm-diff (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/729577 (https://phabricator.wikimedia.org/T292771) (owner: 10Majavah) [11:06:58] RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:07:25] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:09:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm: add helm-diff [puppet] - 10https://gerrit.wikimedia.org/r/729577 (https://phabricator.wikimedia.org/T292771) (owner: 10Majavah) [11:11:07] (03CR) 10jerkins-bot: [V: 04-1] Suggested Edits: Update local config.presets when topics/difficulty presets change [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730371 (https://phabricator.wikimedia.org/T292536) (owner: 10Kosta Harlan) [11:11:17] T292729 is back :| [11:11:17] T292729: TAR_ENTRY_ERROR ENOSPC: no space left on device - https://phabricator.wikimedia.org/T292729 [11:12:29] (03CR) 10Kosta Harlan: [C: 03+2] "CI failure is T292729. I assume it will fail again, but +2 in case it somehow magically works." [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730371 (https://phabricator.wikimedia.org/T292536) (owner: 10Kosta Harlan) [11:12:34] 10SRE, 10Commons, 10Datasets-Archiving, 10Datasets-General-or-Unknown, and 2 others: Back up of Commons files - https://phabricator.wikimedia.org/T160229 (10jcrespo) This is technically done, although waiting for a redundant copy on a geographically remote site before closing it. [11:13:09] kostajh: oh no :( [11:13:14] (03CR) 10Jbond: mediawiki: add get_primary_dc function (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/730440 (owner: 10Jbond) [11:13:34] (03PS1) 10Jcrespo: mediabackups: Backup enwikivoyage media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730483 (https://phabricator.wikimedia.org/T262668) [11:13:36] (03CR) 10Hnowlan: [C: 03+1] "one minor nit, lgtm though" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730481 (https://phabricator.wikimedia.org/T293204) (owner: 10Jgiannelos) [11:13:56] (03PS7) 10Jbond: systemd::sysuser: refactor to provide some useful defaults [puppet] - 10https://gerrit.wikimedia.org/r/729970 [11:14:05] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/729970 (owner: 10Jbond) [11:14:12] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/729970 (owner: 10Jbond) [11:14:17] is it okay if the patches are merged in the new order in Zuul – GrowthExperiments, then Wikibase, then the other GrowthExperiments change? [11:15:23] yes that's fine with me [11:15:28] ok [11:17:21] (03CR) 10Jbond: [C: 03+1] Sync more content of Hiera contact information and owners.yaml [puppet] - 10https://gerrit.wikimedia.org/r/730482 (owner: 10Muehlenhoff) [11:18:18] lol, the other GrowthExperiments one has a flaky selenium test failure from AbuseFilter [11:19:05] !log pool cp4021 after reimage T201317 [11:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:11] T201317: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 [11:19:24] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup enwikivoyage media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730483 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [11:20:45] you love to see it [11:21:05] let’s just abort the other builds so we can restart them quicker [11:21:52] (03CR) 10Kosta Harlan: [C: 03+2] "Previous build was going to fail due to T293211" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730370 (https://phabricator.wikimedia.org/T291251) (owner: 10Gergő Tisza) [11:22:29] (03CR) 10jerkins-bot: [V: 04-1] Add Link: Do not log "no suggestion found" errors in production log [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730370 (https://phabricator.wikimedia.org/T291251) (owner: 10Gergő Tisza) [11:22:45] (03CR) 10Lucas Werkmeister (WMDE): "." [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730370 (https://phabricator.wikimedia.org/T291251) (owner: 10Gergő Tisza) [11:23:06] (03PS1) 10Jbond: cookbook sre: update SREBatchBase/SREBatchRunnerBase with minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/730506 [11:23:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "." [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730370 (https://phabricator.wikimedia.org/T291251) (owner: 10Gergő Tisza) [11:23:40] there we go, now it’s running again [11:23:58] (03PS6) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [11:25:41] (03PS1) 10Inductiveload: Allow copy-upload (by URL) for Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730507 (https://phabricator.wikimedia.org/T293205) [11:27:38] (03CR) 10Inductiveload: "Enabling copy uploads at Wikisource tracked here: https://phabricator.wikimedia.org/T293205" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720058 (https://phabricator.wikimedia.org/T287241) (owner: 10Inductiveload) [11:31:25] (03CR) 10Jbond: interface: update rps script to also set the number of queues via ethtool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [11:32:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: add helm-diff [puppet] - 10https://gerrit.wikimedia.org/r/729577 (https://phabricator.wikimedia.org/T292771) (owner: 10Majavah) [11:33:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-ema: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 (10ema) Trying another reimage as follows: ` root@cumin2002:~# cookbook sre.hosts.reimage --os buster --conftool -t T201317 cp4021 2>&1 | ts |... [11:33:38] !log ema@cumin2002 START - Cookbook sre.hosts.reimage for host cp4021.ulsfo.wmnet with OS buster [11:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-ema: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ema@cumin2002 for host cp4021.ulsfo.wmnet with OS buster [11:34:07] the new GrowthExperiments builds haven’t even started yet :( [11:34:57] eek [11:35:55] 10SRE, 10Toolhub, 10Traffic, 10User-ema: Toolhub API requests with PATCH verbs blocked by CDN - https://phabricator.wikimedia.org/T293157 (10ema) [11:36:58] (03PS1) 10Majavah: kubeadm: fix profile.d/helm-config.sh source [puppet] - 10https://gerrit.wikimedia.org/r/730508 [11:37:38] Lucas_WMDE: kostajh: any objection in me doing T255037 now? [11:37:38] T255037: Deploy Growth features on Italian Wikipedia - https://phabricator.wikimedia.org/T255037 [11:37:56] sounds fine [11:38:01] Zuul says the first Wikibase backport is about to finish [11:38:03] let me look at the log [11:38:19] yeah that’s almost done [11:38:27] so I’d prefer to finish that first if that’s okay [11:39:37] sure [11:39:39] (03Merged) 10jenkins-bot: Instantiate ItemId for SiteLinkConflictLookup results [extensions/Wikibase] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730385 (https://phabricator.wikimedia.org/T293104) (owner: 10Lucas Werkmeister (WMDE)) [11:39:46] ping me when I'm free to go [11:40:05] will do, thanks [11:40:10] (03PS1) 10Majavah: d/control: Remove version from Depends: helm [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/730509 [11:40:26] (03PS1) 10Urbanecm: itwiki: Deploy Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730510 (https://phabricator.wikimedia.org/T255037) [11:40:39] my changes can’t be tested (only fixes an error in case of a fairly rare race condition) so I’ll just sync them directly [11:41:14] 10Puppet, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10ORES: Write puppet for redis-sentinel - https://phabricator.wikimedia.org/T210580 (10jijiki) @Ladsgroup is this work still in progress or abandoned? [11:43:12] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/Wikibase/repo/: Backport: [[gerrit:730385|Instantiate ItemId for SiteLinkConflictLookup results (T293104)]] (duration: 01m 18s) [11:43:15] kostajh: there are some “link suggestion not found for…” errors in logspam-watch, is that what your second backport fixes? [11:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:17] (just out of curiosity) [11:43:18] T293104: TypeError: Argument 1 passed to Wikibase\Repo\Validators\UniquenessViolation::__construct() must implement interface Wikibase\DataModel\Entity\EntityId or be null, string given, called in /srv/mediawiki/php-1.38.0-wmf.3/extensions/Wikibase/repo/includes/Validators/SiteLinkUniquenessValidator.php on line 63 - https://phabricator.wikimedia.org/T293104 [11:43:26] Lucas_WMDE: yes [11:43:29] ok cool :) [11:43:42] still waiting for jenkins on my wmf.4 backport… [11:43:54] at least GrowthExperiments has started running now [11:43:55] 42 minutes :( [11:44:09] looks like it's nearly there [11:45:48] 10Puppet, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10ORES: Write puppet for redis-sentinel - https://phabricator.wikimedia.org/T210580 (10Majavah) >>! In T210580#7423631, @jijiki wrote: > @Ladsgroup is this work still in progress or abandoned? I'll note that I ended up puppetizing redis-sent... [11:45:54] (03Merged) 10jenkins-bot: Instantiate ItemId for SiteLinkConflictLookup results [extensions/Wikibase] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730380 (https://phabricator.wikimedia.org/T293104) (owner: 10Lucas Werkmeister (WMDE)) [11:47:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: fix profile.d/helm-config.sh source [puppet] - 10https://gerrit.wikimedia.org/r/730508 (owner: 10Majavah) [11:48:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/Wikibase/repo/: Backport: [[gerrit:730380|Instantiate ItemId for SiteLinkConflictLookup results (T293104)]] (duration: 01m 07s) [11:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:36] urbanecm: go ahead [11:48:41] thanks [11:48:45] (03CR) 10Urbanecm: [C: 03+2] itwiki: Deploy Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730510 (https://phabricator.wikimedia.org/T255037) (owner: 10Urbanecm) [11:48:47] (and let kostajh know when he can deploy the GrowthExperiments backports) [11:48:50] sure [11:48:55] it should be fairly quick [11:49:08] ok [11:49:50] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=itwiki growthexperiments # T255037 [11:49:51] (03Merged) 10jenkins-bot: itwiki: Deploy Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730510 (https://phabricator.wikimedia.org/T255037) (owner: 10Urbanecm) [11:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:57] T255037: Deploy Growth features on Italian Wikipedia - https://phabricator.wikimedia.org/T255037 [11:50:37] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=itwiki --phab='T255037' # T255037 [11:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:59] hmm...this is weird https://www.irccloud.com/pastebin/rkS1zHUj/ [11:51:06] kostajh: AFAIK, you were doing something with groups? [11:51:23] I'll debug after deployment, nonexisting config files should not be an issue (not in dark mode at least) [11:54:02] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 38a019d4fd6ff8e7cf92f5e7c6a899c336f20235: itwiki: Deploy Growth features in dark mode (T255037; 1/3) (duration: 01m 05s) [11:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/729970 (owner: 10Jbond) [11:54:24] (03CR) 10Muehlenhoff: [C: 03+2] Sync more content of Hiera contact information and owners.yaml [puppet] - 10https://gerrit.wikimedia.org/r/730482 (owner: 10Muehlenhoff) [11:55:07] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 38a019d4fd6ff8e7cf92f5e7c6a899c336f20235: Deploy Growth features in dark mode (T255037; 2/3) (duration: 01m 04s) [11:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:13] T255037: Deploy Growth features on Italian Wikipedia - https://phabricator.wikimedia.org/T255037 [11:55:49] !log mwscript extensions/Translate/scripts/moveTranslatablePage.php --wiki=mediawikiwiki "Growth/Communities/How to introduce yourself as a mentor" "Growth/Communities/How to configure the mentors' list" "Martin Urbanec (WMF)" --reason '[[:phab:T293184]]' # T293184 [11:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:55] T293184: Rename Growth/Communities/How to introduce yourself as a mentor at mediawiki.org - https://phabricator.wikimedia.org/T293184 [11:56:12] !log urbanecm@deploy1002 Synchronized wmf-config/config/itwiki.yaml: 38a019d4fd6ff8e7cf92f5e7c6a899c336f20235: itwiki: Deploy Growth features in dark mode (T255037) (duration: 01m 04s) [11:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:50] kostajh: I'm done with the deployment [11:56:58] feel free to do your stuff (once CI lets you to) [11:57:52] (03Merged) 10jenkins-bot: Suggested Edits: Update local config.presets when topics/difficulty presets change [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730371 (https://phabricator.wikimedia.org/T292536) (owner: 10Kosta Harlan) [11:57:55] (03Merged) 10jenkins-bot: Add Link: Do not log "no suggestion found" errors in production log [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730370 (https://phabricator.wikimedia.org/T291251) (owner: 10Gergő Tisza) [11:58:28] jouncebot: nowandnext [11:58:28] For the next 0 hour(s) and 1 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1100) [11:58:28] In 6 hour(s) and 1 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [11:58:28] In 6 hour(s) and 1 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [11:58:45] it’s probably fine if we overrun a bit [11:59:15] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [11:59:21] Ok I’ll go ahead [12:01:01] so I guess I do both patches together? (haven't encountered this situation yet) [12:01:54] oh, they’re both merged now [12:02:00] kostajh: yeah [12:02:06] you can scap pull now and test both together [12:02:12] ok, doing [12:02:15] I’d still sync them separately [12:02:19] I just need to run `scap sync-file` for both [12:02:20] right [12:02:31] you can do one sync for …/GrowthExperiments/modules and one for …/includes [12:05:37] ACKNOWLEDGEMENT - MD RAID on cp4021 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.128.0.121. Check system logs on 10.128.0.121 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T293225 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:05:41] 10SRE, 10ops-ulsfo: Degraded RAID on cp4021 - https://phabricator.wikimedia.org/T293225 (10ops-monitoring-bot) [12:08:41] (03Abandoned) 10Filippo Giunchedi: wait for config before starting pg [puppet] - 10https://gerrit.wikimedia.org/r/730479 (owner: 10Filippo Giunchedi) [12:09:25] ok, both look good to me, so will sync [12:10:18] 10Puppet, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10ORES: Write puppet for redis-sentinel - https://phabricator.wikimedia.org/T210580 (10Ladsgroup) >>! In T210580#7423631, @jijiki wrote: > @Ladsgroup is this work still in progress or abandoned? definitely abandoned for years. [12:10:59] !log kharlan@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/GrowthExperiments/modules: Backport: [[gerrit:730371|Suggested Edits: Update local config.presets when topics/difficulty presets change (T292536)]] (duration: 01m 07s) [12:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:05] T292536: Task and topic filters not preserved in onboarding - https://phabricator.wikimedia.org/T292536 [12:11:10] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=itwiki --phab='T255037' # after applying 730512 at mwmaint1002 to workaround T293219 # T255037 [12:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:16] T293219: initWikiConfig.php no longer works - https://phabricator.wikimedia.org/T293219 [12:11:17] T255037: Deploy Growth features on Italian Wikipedia - https://phabricator.wikimedia.org/T255037 [12:11:39] (03CR) 10Jbond: [C: 03+2] systemd::sysuser: refactor to provide some useful defaults [puppet] - 10https://gerrit.wikimedia.org/r/729970 (owner: 10Jbond) [12:12:15] !log kharlan@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/GrowthExperiments/includes: Backport: [[gerrit:730370|Add Link: Do not log "no suggestion found" errors in production log (T291251)]] (duration: 01m 04s) [12:12:20] (03PS1) 10Jbond: cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 [12:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:21] T291251: Exception: Link suggestion not found for "Celeste Bonin" - https://phabricator.wikimedia.org/T291251 [12:12:41] alright, I'm finished [12:13:02] !log UTC morning backport+config window done [12:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:24] (03PS6) 10Jbond: systemd::sysuser: add more error checking [puppet] - 10https://gerrit.wikimedia.org/r/729995 [12:13:26] oh, damn, unless urbanecm wasn’t done yet? I should’ve asked first [12:13:32] i was [12:13:35] ok phew [12:14:05] (with deployment, I was still doing some work at the maint srv, but that doesn't interfere with deployments :)) [12:14:42] so that's why I was still !_log'ing [12:15:11] ok [12:15:12] (03CR) 10jerkins-bot: [V: 04-1] cookbooks sre: update run_scripts to accept a list of scripts not functions [cookbooks] - 10https://gerrit.wikimedia.org/r/730513 (owner: 10Jbond) [12:15:30] (03CR) 10jerkins-bot: [V: 04-1] systemd::sysuser: add more error checking [puppet] - 10https://gerrit.wikimedia.org/r/729995 (owner: 10Jbond) [12:15:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31641/console" [puppet] - 10https://gerrit.wikimedia.org/r/729995 (owner: 10Jbond) [12:16:05] (03PS2) 10Jgiannelos: tile-pregeneration: Add snappy codec support for kafka [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730481 [12:18:19] (03CR) 10Jgiannelos: tile-pregeneration: Add snappy codec support for kafka (031 comment) [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730481 (owner: 10Jgiannelos) [12:20:36] (03PS7) 10Jbond: systemd::sysuser: add more error checking [puppet] - 10https://gerrit.wikimedia.org/r/729995 [12:20:54] (03PS4) 10Jbond: systemd::sysuser: also manage a user resource [puppet] - 10https://gerrit.wikimedia.org/r/730012 [12:21:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31642/console" [puppet] - 10https://gerrit.wikimedia.org/r/730012 (owner: 10Jbond) [12:23:42] (03CR) 10Jbond: [C: 03+2] systemd::sysuser: add more error checking [puppet] - 10https://gerrit.wikimedia.org/r/729995 (owner: 10Jbond) [12:25:50] (03PS1) 10Muehlenhoff: os-tracking: Add initial stub tracking file for buster [puppet] - 10https://gerrit.wikimedia.org/r/730516 [12:29:34] (03CR) 10Muehlenhoff: [C: 03+2] os-tracking: Add initial stub tracking file for buster [puppet] - 10https://gerrit.wikimedia.org/r/730516 (owner: 10Muehlenhoff) [12:32:24] 10Puppet, 10Infrastructure-Foundations, 10Machine-Learning-Team, 10ORES: Write puppet for redis-sentinel - https://phabricator.wikimedia.org/T210580 (10akosiaris) 05Open→03Invalid I 'll close then. This was specifically for the ORES case, T122676, which hasn't happened. @Majavah has been kind enough to... [12:33:26] 10SRE, 10Machine-Learning-Team (Active Tasks), 10Sustainability (Incident Followup), 10User-Ladsgroup: Investigate redis-cluster or other techniques for making Redis not a single point of failure. - https://phabricator.wikimedia.org/T181559 (10akosiaris) [12:33:37] (03PS2) 10Jbond: cookbook sre: update SREBatchBase/SREBatchRunnerBase with minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/730506 [12:34:04] (03CR) 10Kormat: [C: 03+1] Revert "db-kill: Fix wmfmariadbpy package dependency" [puppet] - 10https://gerrit.wikimedia.org/r/730487 (owner: 10Jcrespo) [12:34:17] (03PS7) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [12:40:51] (03CR) 10Muehlenhoff: sre.misc-clusters.thumbor: create batch action cook book for thumbor (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [12:42:52] (03PS8) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [12:43:10] (03CR) 10Jbond: "fixed thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [12:51:38] (03CR) 10JMeybohm: [C: 04-1] "Thanks! This will need a entry in debian/changelog as well." [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/730509 (owner: 10Majavah) [12:54:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/730012 (owner: 10Jbond) [12:54:33] jouncebot: next [12:54:34] In 5 hour(s) and 5 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [12:54:34] In 5 hour(s) and 5 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [12:54:51] uhhh ok I guess I will deploy my changes tomorrow [12:55:05] (03PS2) 10Majavah: d/control: Remove version from Depends: helm [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/730509 [12:55:22] (03CR) 10Majavah: d/control: Remove version from Depends: helm (031 comment) [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/730509 (owner: 10Majavah) [12:57:36] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm) [12:58:38] (03CR) 10Muehlenhoff: sre.misc-clusters.thumbor: create batch action cook book for thumbor (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [13:03:14] (03CR) 10JMeybohm: [C: 04-1] d/control: Remove version from Depends: helm (031 comment) [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/730509 (owner: 10Majavah) [13:04:28] (03PS3) 10Majavah: d/control: Remove version from Depends: helm [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/730509 [13:05:46] (03PS8) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) [13:05:48] (03PS4) 10Elukey: Set lvs_setup status for the inference service [puppet] - 10https://gerrit.wikimedia.org/r/720009 (https://phabricator.wikimedia.org/T289835) [13:09:57] (03PS1) 10Jbond: apt: add a service descrption for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 [13:09:59] (03PS1) 10Jbond: apt: update apt to use a DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730524 [13:10:01] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10lbowmaker) [13:10:52] (03CR) 10jerkins-bot: [V: 04-1] apt: add a service descrption for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [13:11:23] (03CR) 10jerkins-bot: [V: 04-1] apt: update apt to use a DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730524 (owner: 10Jbond) [13:12:23] (03Abandoned) 10Ladsgroup: jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [13:14:02] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Set yrange for percentiles, setting lower limit to 0 [software/benchmw] - 10https://gerrit.wikimedia.org/r/720911 (owner: 10Alexandros Kosiaris) [13:14:28] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Allow calculating multiple percentiles [software/benchmw] - 10https://gerrit.wikimedia.org/r/720912 (owner: 10Alexandros Kosiaris) [13:15:14] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Support HTTPS as well as HTTP [software/benchmw] - 10https://gerrit.wikimedia.org/r/721255 (owner: 10Alexandros Kosiaris) [13:18:29] (03PS2) 10Jbond: apt: add a service descrption for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 [13:19:20] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: change ml-serve's istio gateway https port [deployment-charts] - 10https://gerrit.wikimedia.org/r/730480 (owner: 10Elukey) [13:21:31] (03CR) 10Herron: [C: 03+1] graphite: expire metric files not updated for 3y [puppet] - 10https://gerrit.wikimedia.org/r/730427 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [13:22:05] (03PS2) 10Jbond: apt: update apt to use a DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730524 [13:22:37] (03CR) 10Herron: [C: 03+1] graphite: disable tags support [puppet] - 10https://gerrit.wikimedia.org/r/729968 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [13:25:42] (03CR) 10Btullis: [C: 03+2] Increase the maximum renewable lifetime of a Kerberos ticket [puppet] - 10https://gerrit.wikimedia.org/r/727349 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:26:22] (03CR) 10JMeybohm: [C: 03+2] d/control: Remove version from Depends: helm [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/730509 (owner: 10Majavah) [13:27:05] (03PS8) 10Herron: warn on idle centrallog mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) [13:30:51] (03Merged) 10jenkins-bot: d/control: Remove version from Depends: helm [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/730509 (owner: 10Majavah) [13:32:42] 10SRE-swift-storage: Media storage metadata inconsistent with Swift - https://phabricator.wikimedia.org/T289996 (10jcrespo) [13:32:54] (03CR) 10Herron: [C: 03+2] warn on idle centrallog mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [13:34:10] 10SRE-swift-storage: Media storage metadata inconsistent with Swift - https://phabricator.wikimedia.org/T289996 (10jcrespo) Enwikivoyage was created in 2012 at Wikimedia: https://www.mail-archive.com/newprojects@lists.wikimedia.org/msg00015.html But still references deleted files from 2008. [13:34:31] !log ema@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4021.ulsfo.wmnet with OS buster [13:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-ema: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ema@cumin2002 for host cp4021.ulsfo.wmnet with OS buster completed: - cp... [13:35:14] (03PS8) 10ZPapierski: [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) [13:35:53] (03Merged) 10jenkins-bot: warn on idle centrallog mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [13:37:10] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [13:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:38:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Remove T187985 leftover [puppet] - 10https://gerrit.wikimedia.org/r/728468 (owner: 10Alexandros Kosiaris) [13:38:53] !log imported helm-diff_3.1.3-2 to buster-wikimedia (https://gerrit.wikimedia.org/r/c/operations/debs/helm-diff/+/730509) [13:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:09] majavah: ^ [13:39:14] (03CR) 10Muehlenhoff: "The apt servers are not really active/active, the secondary is a cold spare; they underlying package data is synced via rsync and a system" [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [13:39:29] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove old cruft [software/otrs] - 10https://gerrit.wikimedia.org/r/728595 (owner: 10Alexandros Kosiaris) [13:39:47] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Update package for match Znuny 6.0.37 [software/otrs] - 10https://gerrit.wikimedia.org/r/728478 (owner: 10Alexandros Kosiaris) [13:40:02] thanks! looks good from my side [13:40:23] Great! Thanks for fixing [13:40:34] (03PS6) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [13:41:05] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [13:42:08] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lbowmaker - https://phabricator.wikimedia.org/T293241 (10DAbad) I approve this request for Luke Bowmaker. [13:48:21] RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:04] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10CDanis) @MunizaA can you confirm that this wikitech user is you? https://ldap.toolforge.org/user/mnz Also would you rather have mnza0001@gmail.com (from that wikitech account) or muna... [13:50:03] 10SRE, 10Toolhub, 10Traffic, 10User-bd808, 10User-ema: Toolhub API requests with PATCH verbs blocked by CDN - https://phabricator.wikimedia.org/T293157 (10bd808) 05Open→03Resolved a:03bd808 Confirmed to be working, Thank you for the quick attention @ema. [13:55:09] (03CR) 10Hnowlan: [C: 03+1] tile-pregeneration: Add snappy codec support for kafka [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730481 (owner: 10Jgiannelos) [13:56:05] (03CR) 10Nikerabbit: "It looks like you linked a wrong task." [puppet] - 10https://gerrit.wikimedia.org/r/728468 (owner: 10Alexandros Kosiaris) [13:56:47] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The following units failed: varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:56] (03PS1) 10Muehlenhoff: package_builder: Add python-sphinx [puppet] - 10https://gerrit.wikimedia.org/r/730529 [13:57:15] (03CR) 10Jgiannelos: [C: 03+2] tile-pregeneration: Add snappy codec support for kafka [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730481 (owner: 10Jgiannelos) [13:58:02] (03CR) 10jerkins-bot: [V: 04-1] package_builder: Add python-sphinx [puppet] - 10https://gerrit.wikimedia.org/r/730529 (owner: 10Muehlenhoff) [13:58:52] (03Merged) 10jenkins-bot: tile-pregeneration: Add snappy codec support for kafka [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730481 (owner: 10Jgiannelos) [13:59:41] !log push prep-work for anycast tuning in ulsfo - T288843 [13:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:50] T288843: Traffic Engineering for Anycast Ranges - https://phabricator.wikimedia.org/T288843 [14:00:30] (03CR) 10Alexandros Kosiaris: otrs: Remove T187985 leftover (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728468 (owner: 10Alexandros Kosiaris) [14:03:33] (03PS9) 10ZPapierski: [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) [14:04:58] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on October 9, 2021. - https://phabricator.wikimedia.org/T293251 (10AlexisJazz) [14:05:55] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on October 9, 2021. - https://phabricator.wikimedia.org/T293251 (10Majavah) [14:06:17] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [14:06:25] RECOVERY - Check systemd state on wdqs2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:37] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:09:42] (03PS1) 10Jcrespo: mediabackups: Backup jvwikisource media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730533 (https://phabricator.wikimedia.org/T262668) [14:09:59] 10SRE, 10SRE Observability: Develop tooling for quickly parsing 5xx and sampled-1000 logs - https://phabricator.wikimedia.org/T292682 (10colewhite) [14:10:07] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10DAbad) [14:10:26] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup jvwikisource media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730533 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [14:10:26] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:10:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:55] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:10:55] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:44] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:13:44] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:58] (03PS9) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) [14:16:00] (03PS5) 10Elukey: Set lvs_setup status for the inference service [puppet] - 10https://gerrit.wikimedia.org/r/720009 (https://phabricator.wikimedia.org/T289835) [14:20:28] !log temporarily downgrade sphinx packages on deneb to 1.7.9-1~bpo9+1 to build a Ganeti 2.16 stretch backport with delicate toolchain needs [14:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:04] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:21:04] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:11] PROBLEM - DPKG on deneb is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:22:35] (03CR) 10Vgutierrez: [C: 03+1] role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [14:23:40] (03CR) 10Vgutierrez: [C: 03+1] Set lvs_setup status for the inference service [puppet] - 10https://gerrit.wikimedia.org/r/720009 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [14:24:38] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [14:25:30] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:25:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:29] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:27:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:19] (03PS1) 10Jcrespo: mediabackups: Backup mgwiktionary media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730534 (https://phabricator.wikimedia.org/T262668) [14:28:35] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/730536 [14:28:46] (03PS6) 10Elukey: Set lvs_setup status for the inference service [puppet] - 10https://gerrit.wikimedia.org/r/720009 (https://phabricator.wikimedia.org/T289835) [14:29:39] (03PS2) 10Jgiannelos: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/730536 [14:30:01] (03CR) 10Elukey: [C: 03+2] Set lvs_setup status for the inference service [puppet] - 10https://gerrit.wikimedia.org/r/720009 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [14:31:16] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup mgwiktionary media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730534 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [14:31:54] (03CR) 10Lucas Werkmeister (WMDE): "I think it would be good to mention in the commit message that (if I understand correctly) this won’t actually allow URL uploads on any Wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730507 (https://phabricator.wikimedia.org/T293205) (owner: 10Inductiveload) [14:34:46] (03CR) 10Jcrespo: [C: 03+2] Revert "db-kill: Fix wmfmariadbpy package dependency" [puppet] - 10https://gerrit.wikimedia.org/r/730487 (owner: 10Jcrespo) [14:36:00] !log restart pybal on lvs1016 (low-traffic secondary) to pick up new config for inference.discovery.wmnet - T289835 [14:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:06] T289835: Create a LB service for inference.discovery.wmnet - https://phabricator.wikimedia.org/T289835 [14:37:47] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.63:30443]) https://wikitech.wikimedia.org/wiki/PyBal [14:40:59] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 74 connections established with conf1004.eqiad.wmnet:4001 (min=75) https://wikitech.wikimedia.org/wiki/PyBal [14:41:31] the ipvs diff is me, I haven't restarted pybal on 1015 yet [14:41:39] the second alarm no idea [14:41:47] vgutierrez: --^ [14:42:31] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.63:30443]) https://wikitech.wikimedia.org/wiki/PyBal [14:44:17] !log elukey@puppetmaster1001 conftool action : ge; selector: cluster=ml_serve,service=inference [14:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:00] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/730536 (owner: 10Jgiannelos) [14:48:31] !log reverted to clean package state on deneb [14:48:35] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] all right I forgot to set the nodes as pooled in confctl, lvs1016 looks good now [14:49:58] (03Abandoned) 10Muehlenhoff: package_builder: Add python-sphinx [puppet] - 10https://gerrit.wikimedia.org/r/730529 (owner: 10Muehlenhoff) [14:50:17] !log restart pybal on lvs1015 (low-traffic primary) to pick up new config for inference.discovery.wmnet - T289835 [14:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:24] T289835: Create a LB service for inference.discovery.wmnet - https://phabricator.wikimedia.org/T289835 [14:51:09] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/730536 (owner: 10Jgiannelos) [14:51:18] !log restarting ircecho.service on alert1001 to get back icinga-wm without the underscore [14:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:59] RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:59] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 75 connections established with conf1004.eqiad.wmnet:4001 (min=75) https://wikitech.wikimedia.org/wiki/PyBal [14:52:01] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:52:19] RECOVERY - DPKG on deneb is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:52:43] goood [14:52:47] !log repool cp4021, further testing can be performed on sretest1001 T201317 [14:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:53] T201317: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 [14:54:05] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:54:05] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:11] (03CR) 10Cwhite: [C: 03+1] graphite: disable tags support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/729968 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [14:56:11] (03CR) 10Cwhite: [C: 03+1] graphite: expire metric files not updated for 3y [puppet] - 10https://gerrit.wikimedia.org/r/730427 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [14:56:31] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:56:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:56] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:56:56] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:15] (03PS1) 10Elukey: Add dns discovery settings for inference.svc.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/730541 (https://phabricator.wikimedia.org/T289835) [14:58:57] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10Ottomata) Okay! [14:59:02] (03PS7) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [14:59:07] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [14:59:07] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [14:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:32] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:47] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [15:01:28] (03PS8) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [15:01:42] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [15:01:42] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [15:01:45] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:00] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [15:02:18] (03PS2) 10Inductiveload: Allow copy-upload (by URL) for Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730507 (https://phabricator.wikimedia.org/T293205) [15:02:27] (03CR) 10Inductiveload: Allow copy-upload (by URL) for Wikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730507 (https://phabricator.wikimedia.org/T293205) (owner: 10Inductiveload) [15:03:27] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [15:03:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [15:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:43] (03PS9) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [15:04:07] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [15:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:50] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [15:05:32] (03PS2) 10Ayounsi: Add transit BGP communities for anycast traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) [15:05:34] (03PS2) 10Ayounsi: Configure transit specific outbound BGP communities [homer/public] - 10https://gerrit.wikimedia.org/r/728256 (https://phabricator.wikimedia.org/T288843) [15:06:30] (03CR) 10jerkins-bot: [V: 04-1] Configure transit specific outbound BGP communities [homer/public] - 10https://gerrit.wikimedia.org/r/728256 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [15:06:32] (03CR) 10jerkins-bot: [V: 04-1] Add transit BGP communities for anycast traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [15:06:55] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:09:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [15:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:36] (03PS3) 10Ayounsi: Add transit BGP communities for anycast traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) [15:10:38] (03PS3) 10Ayounsi: Configure transit specific outbound BGP communities [homer/public] - 10https://gerrit.wikimedia.org/r/728256 (https://phabricator.wikimedia.org/T288843) [15:12:19] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [15:12:20] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [15:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:41] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10dcaro) @Cmjohnson Hi! Any updates? [15:13:33] !log jbond@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [15:13:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [15:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:46] (03PS4) 10Ayounsi: Add transit BGP communities for anycast traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) [15:16:48] (03PS4) 10Ayounsi: Configure transit specific outbound BGP communities [homer/public] - 10https://gerrit.wikimedia.org/r/728256 (https://phabricator.wikimedia.org/T288843) [15:16:50] (03PS10) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [15:16:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [15:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:32] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [15:17:49] (03PS3) 10Jbond: apt: add a service description for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 [15:18:09] (03CR) 10jerkins-bot: [V: 04-1] apt: add a service description for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [15:19:11] (03PS4) 10Jbond: apt: add a service description for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 [15:19:32] (03CR) 10Jbond: apt: add a service description for apt to allow DNS discovery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [15:19:57] (03CR) 10jerkins-bot: [V: 04-1] apt: add a service description for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 (owner: 10Jbond) [15:21:52] (03CR) 10Ayounsi: "See diff between PS1 and PS4" [homer/public] - 10https://gerrit.wikimedia.org/r/728255 (https://phabricator.wikimedia.org/T288843) (owner: 10Ayounsi) [15:25:47] (03PS3) 10Jbond: apt: add a service description for apt to allow DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730524 [15:26:27] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [15:26:28] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [15:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:06] (03CR) 10Cwhite: [C: 03+2] logstash: dot_expander: better handling of field collisions [puppet] - 10https://gerrit.wikimedia.org/r/728682 (https://phabricator.wikimedia.org/T292099) (owner: 10Cwhite) [15:28:18] (03PS4) 10Jbond: apt: add a service description for apt to allow DNS discovery [dns] - 10https://gerrit.wikimedia.org/r/730524 [15:28:59] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [15:28:59] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [15:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:10] (03CR) 10Vgutierrez: "PCC seems happy: https://puppet-compiler.wmflabs.org/compiler1001/31645/" [puppet] - 10https://gerrit.wikimedia.org/r/730016 (https://phabricator.wikimedia.org/T292619) (owner: 10Vgutierrez) [15:29:28] (03PS1) 10Jcrespo: mediabackups: Backup shwiki media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730550 (https://phabricator.wikimedia.org/T262668) [15:29:56] (03PS5) 10Jbond: apt: add a service description for apt to allow DNS discovery [puppet] - 10https://gerrit.wikimedia.org/r/730523 [15:31:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::sysuser: also manage a user resource [puppet] - 10https://gerrit.wikimedia.org/r/730012 (owner: 10Jbond) [15:32:01] (03PS2) 10Jcrespo: mediabackups: Backup shwiki media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730550 (https://phabricator.wikimedia.org/T262668) [15:32:42] (03PS3) 10Giuseppe Lavagetto: kubernetes::deployment_server: add general data for mcrouter + nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/722833 (https://phabricator.wikimedia.org/T291530) [15:32:47] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup shwiki media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730550 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [15:34:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes::deployment_server: add general data for mcrouter + nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/722833 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [15:35:36] (03PS7) 10Jbond: gitlab::ssh explicitly add git user with fixed id [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [15:36:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31646/console" [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [15:36:41] (03PS6) 10Giuseppe Lavagetto: mediawiki: Add rsyslog sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725892 (https://phabricator.wikimedia.org/T288851) [15:36:43] (03PS1) 10Giuseppe Lavagetto: mediawiki: introduce the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730557 (https://phabricator.wikimedia.org/T291530) [15:36:47] (03PS1) 10Giuseppe Lavagetto: toolhub: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730558 (https://phabricator.wikimedia.org/T291530) [15:36:49] (03PS1) 10Giuseppe Lavagetto: changeprop/api-gateway: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730559 (https://phabricator.wikimedia.org/T291530) [15:38:17] (03CR) 10Wolfgang Kandek: "I really like the simplification this script introduces." [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [15:38:20] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10MunizaA) >>! In T292955#7424405, @CDanis wrote: > @MunizaA can you confirm that this wikitech user is you? https://ldap.toolforge.org/user/mnz > > Also would you rather have mnza0001@... [15:39:05] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [15:40:06] (03PS1) 10Jcrespo: mediabackups: Backup srwiki media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730560 (https://phabricator.wikimedia.org/T262668) [15:40:20] (03PS2) 10Jcrespo: mediabackups: Backup srwiki media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730560 (https://phabricator.wikimedia.org/T262668) [15:43:21] (03PS7) 10Giuseppe Lavagetto: mediawiki: Add rsyslog sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725892 (https://phabricator.wikimedia.org/T288851) [15:43:23] (03PS2) 10Giuseppe Lavagetto: mediawiki: introduce the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730557 (https://phabricator.wikimedia.org/T291530) [15:43:25] (03PS2) 10Giuseppe Lavagetto: toolhub: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730558 (https://phabricator.wikimedia.org/T291530) [15:43:27] (03PS2) 10Giuseppe Lavagetto: changeprop/api-gateway: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730559 (https://phabricator.wikimedia.org/T291530) [15:53:38] (03PS11) 10Arturo Borrero Gonzalez: cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) [15:54:12] (03CR) 10jerkins-bot: [V: 04-1] cloudbackup: deploy cinder-backup service [puppet] - 10https://gerrit.wikimedia.org/r/728400 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [15:56:01] (03CR) 10Cwhite: [C: 03+2] bump changelog 1.7.0-5 [software/ecs] - 10https://gerrit.wikimedia.org/r/726226 (owner: 10Cwhite) [15:57:13] (03Merged) 10jenkins-bot: bump changelog 1.7.0-5 [software/ecs] - 10https://gerrit.wikimedia.org/r/726226 (owner: 10Cwhite) [16:07:33] (03PS1) 10Cwhite: logstash: deploy ecs patch 5 [puppet] - 10https://gerrit.wikimedia.org/r/730588 [16:12:41] (03PS1) 10Urbanecm: initWikiConfig: Fix loading difficulty/group from SUGGESTED_EDITS_TASK_TYPES [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730567 (https://phabricator.wikimedia.org/T293219) [16:12:59] (03PS1) 10Urbanecm: initWikiConfig: Fix loading difficulty/group from SUGGESTED_EDITS_TASK_TYPES [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730568 (https://phabricator.wikimedia.org/T293219) [16:13:03] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10CDanis) [16:14:22] (03PS1) 10CDanis: mnz: shell & analytics-privatedata-users w/krb [puppet] - 10https://gerrit.wikimedia.org/r/730589 (https://phabricator.wikimedia.org/T292955) [16:15:29] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup srwiki media on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/730560 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:23:24] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [16:23:24] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [16:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:58] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:25:08] (03PS9) 10Ryan Kemper: elasticsearch: begin pull out spicerack config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) [16:31:34] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out spicerack config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [16:32:28] jouncebot: nowandnext [16:32:28] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [16:32:28] In 1 hour(s) and 27 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [16:32:29] In 1 hour(s) and 27 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [16:32:40] (03CR) 10Urbanecm: [C: 03+2] initWikiConfig: Fix loading difficulty/group from SUGGESTED_EDITS_TASK_TYPES [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730567 (https://phabricator.wikimedia.org/T293219) (owner: 10Urbanecm) [16:32:44] (03CR) 10Urbanecm: [C: 03+2] initWikiConfig: Fix loading difficulty/group from SUGGESTED_EDITS_TASK_TYPES [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730568 (https://phabricator.wikimedia.org/T293219) (owner: 10Urbanecm) [16:36:38] (03PS1) 10JMeybohm: istio: Add wmf-certificates proxyv2 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/730591 [16:37:44] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [16:37:44] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [16:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:12] (03CR) 10Dzahn: "between wikiweb and wikikube I vote for wikikube for reasons stated above by Jelto, the scope seems wider than just web with the equivalen" [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [16:47:53] (03PS1) 10Dzahn: Revert "miscweb: set staging version to production to debug issue pulling from registry" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730570 [16:48:00] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 2795 MB (3% inode=84%): /tmp 2795 MB (3% inode=84%): /var/tmp 2795 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [16:49:22] (03PS1) 10Ssingh: wikidough: update monitoring::service to check the host IP [puppet] - 10https://gerrit.wikimedia.org/r/730593 [16:50:16] !log stat1008 - apt-get clean - freed 1.3 GB disk space - was alerting in Icinga because / was 97% full [16:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:59] (03PS1) 10Zabe: Update logo for liwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730594 (https://phabricator.wikimedia.org/T291479) [16:51:58] (03Merged) 10jenkins-bot: initWikiConfig: Fix loading difficulty/group from SUGGESTED_EDITS_TASK_TYPES [extensions/GrowthExperiments] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730567 (https://phabricator.wikimedia.org/T293219) (owner: 10Urbanecm) [16:52:19] (03PS2) 10Ssingh: wikidough: update monitoring::service to check the host IP [puppet] - 10https://gerrit.wikimedia.org/r/730593 [16:54:01] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31647/console" [puppet] - 10https://gerrit.wikimedia.org/r/730593 (owner: 10Ssingh) [16:55:47] (03PS1) 10Volans: client: split connect and read timeout [software/debmonitor] - 10https://gerrit.wikimedia.org/r/730597 [16:57:02] !log stat1008 - short on disk space, mostly used in /tmp, high CPU usage by R proccess, sent a message about it to all shell users via wall [16:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:38] (03PS1) 10Bartosz Dziewoński: Add NS_MAIN back to wgExtraSignatureNamespaces for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730598 (https://phabricator.wikimedia.org/T291630) [16:57:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/730597 (owner: 10Volans) [16:58:32] (03PS1) 10Zabe: Remove an old dawiki temporaray logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730599 [16:59:20] (03Merged) 10jenkins-bot: initWikiConfig: Fix loading difficulty/group from SUGGESTED_EDITS_TASK_TYPES [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730568 (https://phabricator.wikimedia.org/T293219) (owner: 10Urbanecm) [17:00:19] (03CR) 10Volans: [C: 03+2] client: split connect and read timeout [software/debmonitor] - 10https://gerrit.wikimedia.org/r/730597 (owner: 10Volans) [17:00:51] to release can I just send a patch for the debian branch? [17:00:59] * volans wrong tab [17:02:53] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [17:02:53] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [17:03:07] (03Merged) 10jenkins-bot: client: split connect and read timeout [software/debmonitor] - 10https://gerrit.wikimedia.org/r/730597 (owner: 10Volans) [17:03:12] here [17:03:18] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 4 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (1032X)... [17:03:37] cr2-eqiad [17:03:37] I'm here [17:04:11] (do we already have a task for why the device name is in the alert annotations but not the IRC message?) [17:04:14] it's the codfw-eiad link [17:04:21] codfw-eqiad transport link [17:04:33] https://librenms.wikimedia.org/device/device=2/tab=port/port=11592/ [17:04:38] (03CR) 10Dzahn: [C: 03+1] gitlab::ssh explicitly add git user with fixed id [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [17:05:12] rzl: my backports merged a while ago (just noticed), and then a page came -- is it a good idea to do deployment, or should i let you investigate and then deploy? [17:05:26] oh, I had a missed call but not a text, thought it was just one of the scams [17:05:41] urbanecm: hold off for just a moment please, if it's not too inconvenient [17:05:46] not at all [17:05:51] * jbond here [17:06:33] please lmk once it's safe to proceed [17:06:42] ack will do [17:07:02] is it possible this is related to the swift rebalancing that is ongoing in the last few days? [17:07:23] joe: ahhh [17:07:26] yes [17:07:41] traffic is going down under the threshold [17:07:53] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [17:07:53] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [17:08:18] ok, I'm going afk now :) [17:08:30] also related to telia being down [17:09:12] ftr: checked maint-announce calendar. has codfw entries but just annual UPS & battery maintenance and chiller maintenance [17:09:19] godog: still around? [17:10:03] mutante: thanks, down since 7am today [17:10:06] UTC [17:10:31] urbanecm: you should be fine to go ahead [17:10:40] XioNoX: hi [17:10:43] +1 [17:11:01] rzl: thanks [17:11:02] reading now, but no swift rebalances don't cross sites [17:11:19] godog: ok, noted! [17:11:26] swift rebalances for ms-* hosts that is, thanos-be* hosts do but those are not rebalancing [17:11:29] godog: only replication does, but I thought that might be affected by rebalances [17:11:30] to be precise [17:11:39] there is an increase of cross DC traffic any idea what could cause it? [17:11:49] sry late to the party somehow got disconnected from this channel earlier it seems [17:12:03] joe: it isn't afaik [17:12:17] I opened https://phabricator.wikimedia.org/T293091 yesterday, which is the longer term fix [17:12:33] I'm going to open a Telia ticket for the link down [17:12:54] and the 3rd point is the overall increase of codfw-eqiad traffic [17:12:56] XioNoX: can you identify from librenms where is the increase in traffic originating from? [17:13:20] (the answer to my question before is, we do already have T273716) [17:13:21] T273716: Improve Alertmanager/LibreNMS notifications - https://phabricator.wikimedia.org/T273716 [17:13:34] joe: no, it's not sharp enough, internal netflow is what we need (and there is a task for it :) ) [17:13:47] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/GrowthExperiments/maintenance/initWikiConfig.php: 5c27154cf434bebc37f5e98e2ad1b5cea7cde1d4: initWikiConfig: Fix loading difficulty/group from SUGGESTED_EDITS_TASK_TYPES (T293219) (duration: 01m 15s) [17:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:53] T293219: initWikiConfig.php no longer works - https://phabricator.wikimedia.org/T293219 [17:14:03] did I read the alert right that the traffic increase was codfw -> eqiad ? [17:14:29] network utilization within eqiad and codfw on the "datacenter-overview" grafana board looks fairly normal, no obvious spike there [17:14:44] godog: yep, from codfw to eqiad [17:14:51] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/GrowthExperiments/maintenance/initWikiConfig.php: dd7a3314602ffddc5b917cccc71c917301639388: initWikiConfig: Fix loading difficulty/group from SUGGESTED_EDITS_TASK_TYPES (T293219) (duration: 01m 04s) [17:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:28] https://librenms.wikimedia.org/bill/bill_id=24/ the monthly view doesn't show any large increase, so I might have been put off by the redundant link down [17:15:30] topranks: ack, thanks [17:15:35] * urbanecm done [17:16:28] but yeah looks like the spike now was just about to tip utilization over the threshold [17:16:31] https://librenms.wikimedia.org/graphs/to=1634145300/id=11592/type=port_bits/from=1634058900/ [17:16:41] is this under the treshold to create a doc? (no users affected). "just Telia"? [17:16:57] this morning got pretty close too afaics [17:17:16] yeah [17:18:34] I'm sure we're slightly constrained, but it's not "red lining" either. [17:18:35] so assumption would be it's not noticeably affecting users. [17:18:41] +1 [17:18:52] just enough above the threshold to be annoying [17:19:28] cool, aka good enough [17:19:44] yeah looks like a few datapoints [17:20:49] XioNoX: aha "Telia Carrier Ref: 01344862 Disturbance Information to Wikimedia Foundation Inc." [17:21:09] it is in maint-announce inbox but not on calendar yet, sent 10 h ago [17:21:25] "your circuit is affected by major outage incident in Atlanta due to cable fault , our team are currently investigating this outage " [17:21:37] mutante: good catch [17:21:50] "testing will begin shortly" 3 min ago! [17:21:51] "Please be informed our provider technician has arrived on site at 180 Peachtree St. The technician is standing by while we wait for access to be approved. Once approved testing will begin. We will keep you updated with our finding. Thank you for your patience during this time." 52min ago [17:21:53] ah! [17:22:23] "One of our provider technician has gained access" [17:22:38] so nothing is really actionable [17:22:48] we could raise the threshold temporarily [17:23:03] imagines they got the lockpicking lawyer [17:24:05] XioNoX: SGTM [17:24:48] mutante: at least it's not a bridge on fire like a recent Lumen fibercut [17:25:23] lol [17:25:23] alternatively the delay XioNoX so there's a little more leeway than 5m [17:25:35] XioNoX: haha, yea, or thieves stealing a couple hundred feet of cable thinking it has copper [17:25:57] godog: the issue is that it's also one of our DDoS signals [17:27:12] XioNoX: that's true, yeah threshold it is [17:27:58] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn Telia working on cable fault in Atlanta https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:27:58] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn Telia working on cable fault in Atlanta https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:29:01] (03PS1) 10Volans: hosts: prevent already exists error [software/debmonitor] - 10https://gerrit.wikimedia.org/r/730604 [17:29:31] (03CR) 10Krinkle: [C: 03+1] static.php: Add support for /static/current rewrites (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730182 (https://phabricator.wikimedia.org/T285232) (owner: 10Giuseppe Lavagetto) [17:29:51] (03PS1) 10AOkoth: gitlab: add gitlab restore systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/730605 (https://phabricator.wikimedia.org/T285867) [17:30:08] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 2433 MB (3% inode=84%): /tmp 2433 MB (3% inode=84%): /var/tmp 2433 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [17:31:17] 10SRE, 10ops-eqiad, 10Analytics-Clusters: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10BTullis) 05Open→03Resolved [17:31:29] (03PS2) 10AOkoth: gitlab: add gitlab restore systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/730605 (https://phabricator.wikimedia.org/T285867) [17:31:58] ok going afk, talk to you tomorrow [17:34:29] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10Dzahn) [17:34:51] (03CR) 10Jbond: [C: 03+1] "LGTM lets give it a go" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/730604 (owner: 10Volans) [17:35:10] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=stat1008&service=Disk+space [17:37:33] (03CR) 10Dzahn: "This looks good. Do you want logging and monitoring for this? logging is per default on, monitoring is per default off. they are both para" [puppet] - 10https://gerrit.wikimedia.org/r/730605 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [17:39:18] (03CR) 10Volans: [C: 03+2] hosts: prevent already exists error [software/debmonitor] - 10https://gerrit.wikimedia.org/r/730604 (owner: 10Volans) [17:43:42] (03CR) 10Dzahn: [V: 03+1] "In https://puppet-compiler.wmflabs.org/compiler1001/31649/ you can see how restore_ensure is present on 2001 but absent on 1001, and this " [puppet] - 10https://gerrit.wikimedia.org/r/730605 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [17:44:10] (03Merged) 10jenkins-bot: hosts: prevent already exists error [software/debmonitor] - 10https://gerrit.wikimedia.org/r/730604 (owner: 10Volans) [17:44:18] (03CR) 10Dzahn: [V: 03+1] "click through to "change catalog" for both hosts in the compiler output and search in that page for "restore_ensure"" [puppet] - 10https://gerrit.wikimedia.org/r/730605 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [17:52:42] (03CR) 10Herron: [C: 03+1] logstash: kafka input: add manage_truststore parameter [puppet] - 10https://gerrit.wikimedia.org/r/727625 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:53:56] XioNoX: "Please note that if your service is a protected service, you should not experience any issues as your service is running on a protected path." when they say this do they mean physical protection? [17:54:57] mutante: "optical" protection equipment that automatically re-route the lasers to a backup path [17:55:28] but it usually costs twice the price, so better get explicit redundancy [17:55:35] (03PS1) 10Volans: Upstream release v0.3.1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/730607 [17:55:36] XioNoX: ACK, gotcha. so https://en.wikipedia.org/wiki/Path_protection [17:56:45] eh, didn't know that had it's own wikipedia page :) [17:57:07] I wasn't sure if they mean that or like a steel cage when you think of cable cuts, hah [17:57:34] (03CR) 10Volans: [C: 03+2] Upstream release v0.3.1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/730607 (owner: 10Volans) [17:58:58] jouncebot: next [17:58:58] In 0 hour(s) and 1 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [17:58:58] In 0 hour(s) and 1 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [17:59:02] 10SRE, 10Analytics, 10Analytics-Kanban: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10CDanis) Just a quick update that, after a year, this is still happening, on both [[ https://logstash.wikimedia.org/goto/a35e4125b... [17:59:08] is the deployment happening as normal, or is there some issue ongoing? [17:59:13] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10Ottomata) Oh just saw this ticket. Have removed a lot of files in tmp. sudo find /tmp -mtime +10 -size +20M | xargs sudo rm -rfv [17:59:15] the backport window* [17:59:48] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:59:51] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10GoranSMilovanovic) @Dzahn > Most of the space is used by /tmp and there are always users running CPU-intensive processes (R, java, others?) here. I frequently run R jobs from stat1008 (every hour, from crontab)... [18:00:02] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:00:04] dancy and brennen: #bothumor I � Unicode. All rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800). [18:00:04] RoanKattouw and Urbanecm: May I have your attention please! UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [18:00:05] stanglavine, zabe, and MatmaRex: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:19] o/ [18:00:24] MatmaRex: still happening [18:00:25] (03Merged) 10jenkins-bot: Upstream release v0.3.1 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/730607 (owner: 10Volans) [18:00:34] I'm here :) [18:00:39] There was a page but I believe Martin was told things were safe [18:00:50] I can deploy today! [18:01:47] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10Dzahn) @GoranSMilovanovic Moving files out of your home should not matter since they are already on a separate and large partition mounted on /srv. This is just about / and the /tmp inside it. I may have misint... [18:02:28] RhinosF1: yea, that's right [18:02:50] also: RECOVERY - Router interfaces on cr1-eqiad is OK: [18:02:56] it seems to be fixed at Telia [18:03:18] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [18:03:36] Thanks mutante [18:03:41] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10GoranSMilovanovic) @Dzahn > I may have misinterpreted the CPU usage but looked like an R process was using 100% at that time. Ok, I am getting in touch because I typically use R. I always constrain the number... [18:03:49] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10Dzahn) 05Open→03Resolved a:03Dzahn rescheduled Icinga check: <+icinga-wm> RECOVERY - Disk space on stat1008 is OK: DISK OK [18:04:05] stanglavine: why are you renaming the group to autoextendedconfirmed? [18:04:29] because the "extendedconfirmed" will be the "normal" group [18:04:29] (03CR) 10Urbanecm: [C: 03+2] Update logo for liwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730594 (https://phabricator.wikimedia.org/T291479) (owner: 10Zabe) [18:04:32] (not implicit) [18:04:40] urbanecm: [18:04:55] stanglavine: does that mean that you'll have _two_ groups instead of just one? [18:04:59] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10Dzahn) a:05Dzahn→03Ottomata [18:05:07] urbanecm: yes [18:05:13] (03Merged) 10jenkins-bot: Update logo for liwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730594 (https://phabricator.wikimedia.org/T291479) (owner: 10Zabe) [18:05:23] urbanecm: I'll put the change from https://phabricator.wikimedia.org/T290691, hope it's available for that [18:05:53] XioNoX: officially over. "services are restored. RFO will be shared once all details are gathered" [18:06:39] zabe: your patch is at mwdebug1001, please test. [18:07:06] (03CR) 10Dzahn: [C: 03+2] Revert "miscweb: set staging version to production to debug issue pulling from registry" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730570 (owner: 10Dzahn) [18:07:51] mutante: I can confirm link is up, thanks for the update [18:08:06] (03PS2) 10Urbanecm: Remove an old dawiki temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730599 (owner: 10Zabe) [18:08:12] !log uploaded debmonitor-client_0.3.1 to apt.wikimedia.org stretch-wikimedia,buster-wikimedia,bullseye-wikimedia [18:08:12] (03PS3) 10Urbanecm: Remove an old dawiki temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730599 (owner: 10Zabe) [18:08:19] XioNoX: yep, Icinga agrees [18:08:20] (03CR) 10Urbanecm: [C: 03+2] Remove an old dawiki temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730599 (owner: 10Zabe) [18:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:00] (03PS2) 10Urbanecm: Add NS_MAIN back to wgExtraSignatureNamespaces for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730598 (https://phabricator.wikimedia.org/T291630) (owner: 10Bartosz Dziewoński) [18:09:07] (03CR) 10Urbanecm: [C: 03+2] Add NS_MAIN back to wgExtraSignatureNamespaces for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730598 (https://phabricator.wikimedia.org/T291630) (owner: 10Bartosz Dziewoński) [18:09:13] urbanecm: lgtm [18:09:18] thanks, syncing [18:09:23] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [18:09:23] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [18:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:10] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [18:10:11] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [18:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:46] (03Merged) 10jenkins-bot: Remove an old dawiki temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730599 (owner: 10Zabe) [18:10:50] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 8 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" - https://phabricator.wikimedia.org/T204026 (10Damilare) @Nikerabbit it's should be in the deploy... [18:10:54] (03Merged) 10jenkins-bot: Add NS_MAIN back to wgExtraSignatureNamespaces for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730598 (https://phabricator.wikimedia.org/T291630) (owner: 10Bartosz Dziewoński) [18:10:55] zabe: your patch is live [18:11:03] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 1b96f54a518620b0dc6a0ab63b402d0ea2c6bf70: Update logo for liwiktionary (T291479) (duration: 01m 14s) [18:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:08] T291479: Update logo li.wikt - https://phabricator.wikimedia.org/T291479 [18:11:27] zabe: MatmaRex: your patches are at mwdebug1001, please test [18:11:50] looking [18:12:08] (03Merged) 10jenkins-bot: Revert "miscweb: set staging version to production to debug issue pulling from registry" [deployment-charts] - 10https://gerrit.wikimedia.org/r/730570 (owner: 10Dzahn) [18:12:20] urbanecm: seems good [18:12:26] my patch only removes a long unused logo. dawiki doesn't break, so I think we are good [18:12:29] syncing [18:12:37] zabe: ack, syncig too [18:12:38] !log volans@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest1001.eqiad.wmnet [18:12:38] !log volans@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest1001.eqiad.wmnet [18:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:04] (03CR) 10Jelto: gitlab::ssh explicitly add git user with fixed id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [18:14:59] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 224e2a374b1cc6327e9d8c2bca576091ce4efc74: Add NS_MAIN back to wgExtraSignatureNamespaces for mediawikiwiki (T291630) (duration: 01m 05s) [18:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:05] T291630: Main namespace should not be listed in $wgExtraSignatureNamespaces on most wikis (Commons, MediaWiki.org) - https://phabricator.wikimedia.org/T291630 [18:15:20] thanks [18:15:29] MatmaRex: np [18:16:29] !log urbanecm@deploy1002 Synchronized static/images/project-logos: 694bc234ab5dbb9a2387a6129998d45a53ac0ab3: Remove an old dawiki temporary logo (duration: 01m 04s) [18:16:29] stanglavine: you need to create local group pages describing the group, like https://pt.wikipedia.org/wiki/MediaWiki:Group-autoextendedconfirmed [18:16:32] can you do that please? [18:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:52] zabe: the other patch live too [18:16:54] (03CR) 10CDanis: [C: 03+2] mnz: shell & analytics-privatedata-users w/krb [puppet] - 10https://gerrit.wikimedia.org/r/730589 (https://phabricator.wikimedia.org/T292955) (owner: 10CDanis) [18:17:08] (03PS4) 10Juan90264: Create Translation namespace for viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730573 (https://phabricator.wikimedia.org/T290691) [18:17:09] urbanecm: sure, only with the group name? [18:17:50] stanglavine: whatever you want the group to display as [18:17:55] thanks [18:17:57] the human name :)) [18:18:17] urbanecm: ok, done [18:19:11] Hello [18:19:25] I'll put the change right away [18:19:44] urbanecm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/730573 [18:20:08] Juan_90264: please update the calendar [18:20:12] i'll look at it in the end [18:21:10] Done, updated [18:21:12] (03CR) 10Jelto: [C: 03+1] "looks good to me, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/730605 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [18:21:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10CDanis) 05Open→03Resolved Shell account created and it will be live in ~30 minutes. You should be able to connect to `bast1003.wikimedia.org` immediately thou... [18:22:08] stanglavine: can you also do https://pt.wikipedia.org/wiki/MediaWiki:Group-autoextendedconfirmed-member? [18:22:24] (https://pt.wikipedia.org/wiki/MediaWiki:Group-extendedconfirmed-member is an example) [18:23:09] urbanecm: done too [18:23:13] thanks [18:23:18] (03PS3) 10Urbanecm: Set autoconfirmedextended and confirmedextended for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728774 (https://phabricator.wikimedia.org/T292915) (owner: 10Rafael) [18:23:51] stanglavine: and last but not least, https://pt.wikipedia.org/wiki/MediaWiki:Grouppage-autoextendedconfirmed too [18:23:55] (cf. https://pt.wikipedia.org/wiki/MediaWiki:Grouppage-extendedconfirmed) [18:24:05] sorry for not listing them all, took me a while to find them all [18:24:12] * urbanecm just remembers there are three of them [18:24:22] (03CR) 10Urbanecm: [C: 03+2] Set autoconfirmedextended and confirmedextended for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728774 (https://phabricator.wikimedia.org/T292915) (owner: 10Rafael) [18:24:32] pulling into the debug server in the meantime [18:24:52] done too, 3 pages created [18:24:55] thanks [18:25:45] (03Merged) 10jenkins-bot: Set autoconfirmedextended and confirmedextended for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728774 (https://phabricator.wikimedia.org/T292915) (owner: 10Rafael) [18:25:48] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:26:40] Waiting... [18:27:00] jouncebot: now [18:27:00] For the next 0 hour(s) and 32 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [18:27:01] For the next 0 hour(s) and 32 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1800) [18:27:09] (03CR) 10Urbanecm: [C: 04-1] "I don't see a namespace with 114 ID there? You're creating something as an alias to the namespace, but you don't actually define the names" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730573 (https://phabricator.wikimedia.org/T290691) (owner: 10Juan90264) [18:27:20] mutante: do you need to deploy anything? :)) [18:27:56] stanglavine: the patch is at mwdebug1001, please test [18:27:57] urbanecm: no :) not in the scap sense, maybe in k8s sense [18:28:02] ack :) [18:29:55] urbanecm: seems ok to me [18:30:05] thanks, syncing [18:31:22] urbanecm: (maybe you are doing this) but we need clear the old "normal" group [18:31:27] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0bb2b388217aa91a39ed3684f87fdf7edb06fd81: Set autoconfirmedextended and confirmedextended for ptwiki (T292915) (duration: 01m 04s) [18:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:34] T292915: Adjustment on extended confirmed user group on ptwiki - https://phabricator.wikimedia.org/T292915 [18:31:35] clear = empty [18:31:44] I'll solve the problem now, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/730573 [18:31:59] may can you do this by server side? [18:32:14] stanglavine: yeah, i will do it [18:32:22] arnoldokoth: hey, do you want to deploy your change together? [18:32:24] ok, thank you :) [18:32:54] https://pt.wikipedia.org/wiki/Especial:Privil%C3%A9gios/stanglavine says Implicit member of: Autoconfirmed users, Autoconfirmados estendidos, i guess that's what we expect [18:32:56] purging the group [18:33:17] urbanecm: yes [18:33:49] !log [urbanecm@mwmaint1002 ~]$ mwscript emptyUserGroup.php --wiki=ptwiki extendedconfirmed [18:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:11] (03PS5) 10Urbanecm: add extendedconfimed for autoreview group on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728767 (https://phabricator.wikimedia.org/T292912) (owner: 10Rafael) [18:34:27] stanglavine: Removing users from extendedconfirmed...done! Removed 1497 users in total. [18:34:29] Ty urbanecm [18:34:41] https://pt.wikipedia.org/w/index.php?title=Especial:Lista_de_utilizadores/extendedconfirmed is also empty [18:34:49] (03CR) 10Urbanecm: [C: 03+2] add extendedconfimed for autoreview group on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728767 (https://phabricator.wikimedia.org/T292912) (owner: 10Rafael) [18:34:50] urbanecm: thanks! [18:35:17] (03CR) 10Ssingh: [V: 03+1 C: 03+2] wikidough: update monitoring::service to check the host IP [puppet] - 10https://gerrit.wikimedia.org/r/730593 (owner: 10Ssingh) [18:35:41] mutante: Yeah, sure. [18:35:59] any time stanglavine [18:36:05] (03Merged) 10jenkins-bot: add extendedconfimed for autoreview group on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/728767 (https://phabricator.wikimedia.org/T292912) (owner: 10Rafael) [18:36:11] Juan_90264: did you see my CR for your patch? [18:36:29] stanglavine: your other patch is at mwdebug1001, please have a look [18:36:35] looking [18:36:37] urbanecm: it may be worth initSiteStats as that shows the old count for extended confirmed [18:36:37] I saw it and I'm correcting it right now [18:36:38] arnoldokoth: cool! I'll merge it in gerrit first [18:36:48] RhinosF1: right, good idea [18:36:54] (03CR) 10Dzahn: [V: 03+1 C: 03+2] gitlab: add gitlab restore systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/730605 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [18:37:04] !log [urbanecm@mwmaint1002 ~]$ mwscript initSiteStats.php --wiki=ptwiki --update [18:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:31] urbanecm: the other patch seems good too [18:37:35] stanglavine: thanks, syncing [18:38:19] (03PS5) 10Juan90264: Create Translation namespace for viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730573 (https://phabricator.wikimedia.org/T290691) [18:38:45] arnoldokoth: get ready to run puppet on both 1001 and 2001, ok? [18:39:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 06fd0f225575448771cdba0d4e6bf36bb6715bc1: add extendedconfimed for autoreview group on ptwiki (T292912) (duration: 01m 04s) [18:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:05] T292912: Add "extendedconfirmed" right to autoreview usergroup on ptwiki - https://phabricator.wikimedia.org/T292912 [18:39:06] (03CR) 10Juan90264: Create Translation namespace for viwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730573 (https://phabricator.wikimedia.org/T290691) (owner: 10Juan90264) [18:39:12] stanglavine: and, live [18:39:22] urbanecm: https://pt.wikipedia.org/wiki/Especial:Estat%C3%ADsticas still shows 1497 [18:39:23] urbanecm: Adjusted! [18:39:24] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10BTullis) Maybe we could encourage users to set the configuration item `spark.local.dir` to be `~/tmp/` so that any large files are generated under `/srv/`? {F34687508} @Ottomata - Do you think that this would hel... [18:39:25] (03CR) 10jerkins-bot: [V: 04-1] Create Translation namespace for viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730573 (https://phabricator.wikimedia.org/T290691) (owner: 10Juan90264) [18:39:30] RhinosF1: it's still running [18:39:33] many thanks urbanecm! [18:39:34] Ah! [18:39:40] calculating edits, articles, etc. takes a while [18:39:46] Heh [18:39:48] this is my first (and second) patchs :) [18:39:50] stanglavine: any time. Anything else I can do for you today? [18:39:53] congratulations then :) [18:39:59] * RhinosF1 is too used to logging happening at the end [18:40:04] mutante: Ready. [18:40:17] stanglavine: welcome to the club of people whove had their patches deployed [18:40:27] and who did not break Wikipedia (yet) [18:40:37] haha thanks RhinosF1 :) [18:40:40] arnoldokoth: alright, merging on puppet master.. and ..done.. please run puppet agent on both machines and observe what happens [18:40:53] and thanks urbanecm for everything, this is all :) [18:40:58] no problem [18:41:00] we are expecting not much on 1001 but the new timer and service and all that on 2001 [18:41:16] then see Jelto's comment on Gerrit [18:41:50] systemctl list-timers and systemctl status backup-restore [18:42:12] urbanecm: wikipedia is one thing I haven't yet broke! [18:42:36] urbanecm: I moved well outside that club last week, it's suprisingly easy [18:42:39] stanglavine: please join us more often [18:43:15] indeed majavah [18:43:27] Hello Stanglavine, hope the group are implied [18:43:45] mutante: No change on primary. New timer installed on replica. [18:44:29] RhinosF1: when possible, always :) [18:44:42] Juan_90264: the new group is implied yes [18:44:56] init script finished, but number not updated. Likely memcached or something. [18:45:05] Juan_90264: are you still working on fixing your patch? :)) [18:45:17] RhinosF1: Perfect! [18:45:18] urbanecm: very likely, there's far too many cached [18:45:22] Caches* [18:45:23] arnoldokoth: :)) great. So now you have 2 new unit. a timer (systemctl status backup-restore.timer) which is in status " active (waiting)". and says "5h 20min left". looks good, see that? [18:45:55] urbanecm: Fixed already [18:46:11] Juan_90264: jenkins voted -1 though, or am i missing something? [18:47:35] RhinosF1: purged that memcached key manually, special page should be updated too [18:47:38] arnoldokoth: and the other unit is the service that the timer will start (systemctl status backup-restore.service). this is in status "inactive (dead)" because it has never run yet, timer is waiting and hasn't told it to run yet. Let's try to manually start the service to make sure that works as well. do: systemctl start backup-restore and then systemctl status backup-restore [18:47:44] again and it should now not be dead anymore and instead show the command it runs [18:47:51] Juan_90264: you need to use underscores not spaces in namespace names [18:48:19] urbanecm: yep! [18:48:41] Thanks RhinosF1 [18:49:46] urbanecm: could you purge the liwiktionary logo? https://en.wikipedia.org/static/images/project-logos/liwiktionary.png still shows the old one. [18:50:03] thanks for the reminder [18:50:04] sure [18:51:01] zabe: done, can you test? [18:51:59] (03PS6) 10Juan90264: Create Translation namespace for viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730573 (https://phabricator.wikimedia.org/T290691) [18:52:00] works now [18:52:03] thx [18:52:12] (03PS1) 10Ssingh: anycast_monitoring: add checks for Wikidough DoH/DoT [puppet] - 10https://gerrit.wikimedia.org/r/730619 [18:52:23] mutante: I think I'll have to adjust the timer a little bit. Checking 1001 and the time difference between the backup and the restore jobs is rather short. Once it's done running I'll change that. [18:53:14] (03PS7) 10Urbanecm: Create Translation namespace for viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730573 (https://phabricator.wikimedia.org/T290691) (owner: 10Juan90264) [18:53:19] (03CR) 10Urbanecm: [C: 03+2] Create Translation namespace for viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730573 (https://phabricator.wikimedia.org/T290691) (owner: 10Juan90264) [18:53:28] urbanecm: Placed underlines [18:53:51] thanks, let's test [18:55:24] arnoldokoth: I can confirm things look good (nothing changed on 1001, no puppet errors, new units exist on 2001, new service can be started and is running apparently. all looking good. adjustment of the time is of course no problem. by the way, you don't have to use "onCalendar" and say "always at X o'clocl". You can also say just "make sure it happens daily (at any time, but never longer [18:55:30] than 24 hours after the last time) or "never later than X hours after it finished the last time" [18:57:45] (03Merged) 10jenkins-bot: Create Translation namespace for viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730573 (https://phabricator.wikimedia.org/T290691) (owner: 10Juan90264) [18:58:47] Juan_90264: pulled to mwdebug1001, can you test? [18:58:57] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/31650/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/730619 (owner: 10Ssingh) [19:00:04] dancy and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1900). [19:00:17] urbanecm: I took the test, and LGTM [19:00:28] dancy: brennen: please wait for a second [19:00:33] finishing last scap sync-file [19:00:35] OK. Ping me when you're done. [19:00:36] Juan_90264: syncing [19:01:43] Okay [19:02:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 87879865c35edab3ead523027681146e00d6fc02: Create Translation namespace for viwikisource (T290691) (duration: 01m 04s) [19:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:05] Juan_90264: should be live [19:02:06] T290691: Create Translation namespace for vi.wikisource - https://phabricator.wikimedia.org/T290691 [19:02:17] namespaceDupes.php has nothing to fix [19:02:27] dancy: brennen: the floor is yours! [19:02:33] thx [19:02:33] thanks urbanecm [19:02:55] (03PS1) 10Ahmon Dancy: group1 wikis to 1.38.0-wmf.4 refs T281168 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730620 [19:02:57] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.38.0-wmf.4 refs T281168 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730620 (owner: 10Ahmon Dancy) [19:03:33] 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, 10netops: setup drmrs mgmt prefix/range - https://phabricator.wikimedia.org/T293294 (10RobH) p:05Triage→03High [19:03:47] 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, 10netops: setup drmrs mgmt prefix/range - https://phabricator.wikimedia.org/T293294 (10RobH) [19:04:16] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.4 refs T281168 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730620 (owner: 10Ahmon Dancy) [19:05:28] 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, 10netops: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10RobH) [19:05:47] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.4 refs T281168 [19:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:53] T281168: 1.38.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T281168 [19:06:04] 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, 10netops: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10RobH) I chatted with @cmooney about this in IRC and we cannot see if there is a set pattern to which ranges are used for mgmt, private1,... [19:06:50] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.4 refs T281168 (duration: 01m 03s) [19:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:27] !log [gitlab2001:~] $ sudo /usr/bin/gitlab-ctl start gitlab-workhorse [19:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:56] !log gitl1b2001 - started workhorse which was for some reason marked as down after restore command ran [19:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:01] 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, 10netops: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10cmooney) Thanks Rob. Yeah not 100% sure what we should allocate. I'm thinking 10.136.0.0/16 for the site seems logical, with 10.136.12... [19:15:43] Namespaces working on, thanks urbanecm! [19:15:49] np [19:16:00] !log gitlab2001 - status before was that "gitlab-ctl status" showed components "gitlab-workhorse" and "postgres-exporter" as "down". this was either pre-broken or caused by the restore process. after manually 'gitlab-ctl start gitlab-workhorse' all of the components are in "run" and https://gitlab-replica.wikimedia.org is up ( T285867) [19:16:04] 10ops-drmrs, 10DNS, 10Infrastructure-Foundations, 10netops: setup drmrs mgmt & private prefixs - question on switch status - https://phabricator.wikimedia.org/T293294 (10BBlack) CC @MMandere as well once we have a decision on the IP prefixes here! [19:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:07] T285867: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 [19:18:31] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [19:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:22:49] (03PS1) 10Jgiannelos: tegola-vector-tiles: Enable debugging for sql queries [deployment-charts] - 10https://gerrit.wikimedia.org/r/730626 [19:25:06] 10SRE, 10Analytics-Clusters: stat1008 - low on disk - https://phabricator.wikimedia.org/T293283 (10Dzahn) It would definitely help if those tmp files could be under /srv/tmp instead of /tmp. That should fix it if the config change isn't a problem. Nice [19:33:25] 10SRE, 10LDAP: Improve LDAP logging - https://phabricator.wikimedia.org/T214489 (10Htriedman) [19:36:52] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Enable debugging for sql queries [deployment-charts] - 10https://gerrit.wikimedia.org/r/730626 (owner: 10Jgiannelos) [19:42:20] (03Merged) 10jenkins-bot: tegola-vector-tiles: Enable debugging for sql queries [deployment-charts] - 10https://gerrit.wikimedia.org/r/730626 (owner: 10Jgiannelos) [19:43:18] (03CR) 10Dzahn: "# Disable Prompts" [puppet] - 10https://gerrit.wikimedia.org/r/730605 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [19:44:58] (03PS1) 10Jbond: systemd::sysuser: also manage the group if we have a uid:gid id [puppet] - 10https://gerrit.wikimedia.org/r/730627 [19:45:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Jclark-ctr) [19:45:30] (03CR) 10Dzahn: "19:08 mutante: [gitlab2001:~] $ sudo /usr/bin/gitlab-ctl start gitlab-workhorse" [puppet] - 10https://gerrit.wikimedia.org/r/730605 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [19:48:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Jclark-ctr) kubernetes1021 is up and bios configured [19:49:40] (03CR) 10Jbond: [V: 03+1 C: 03+1] "thanks for the additional testing see inline" [puppet] - 10https://gerrit.wikimedia.org/r/728380 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [19:50:29] (03CR) 10BryanDavis: [C: 03+2] toolhub: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730558 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [19:57:28] (03PS3) 10BryanDavis: toolhub: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730558 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [19:57:42] (03CR) 10BryanDavis: toolhub: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730558 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [19:58:35] (03PS2) 10BryanDavis: toolhub: Get mcrouter image tags from upstream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/723618 (https://phabricator.wikimedia.org/T291530) [20:00:04] dancy and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T1900). [20:00:05] chrisalbon and accraze: That opportune time is upon us again. Time for a Services – Graphoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T2000). [20:01:16] (03PS1) 10AOkoth: gitlab: remove unnecessary comments from restore script [puppet] - 10https://gerrit.wikimedia.org/r/730630 (https://phabricator.wikimedia.org/T285867) [20:01:19] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1021.eqiad.wmnet with OS stretch [20:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kubernetes1021.eqiad.wmnet with OS stretch [20:03:59] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kubernetes1021.eqiad.wmnet with OS stretch [20:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kubernetes1021.eqiad.wmnet with OS stretch executed with errors: -... [20:04:38] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1021.eqiad.wmnet with OS stretch [20:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kubernetes1021.eqiad.wmnet with OS stretch [20:05:22] (03CR) 10BryanDavis: [C: 03+2] toolhub: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730558 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [20:06:58] (03PS1) 10Dzahn: gitlab: deactivate new backup-restore unit for now [puppet] - 10https://gerrit.wikimedia.org/r/730632 (https://phabricator.wikimedia.org/T285867) [20:09:17] (03Merged) 10jenkins-bot: toolhub: use the common_images data structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/730558 (https://phabricator.wikimedia.org/T291530) (owner: 10Giuseppe Lavagetto) [20:11:01] (03CR) 10BryanDavis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/723618 (https://phabricator.wikimedia.org/T291530) (owner: 10BryanDavis) [20:14:34] (03PS1) 10MSantos: tegola-vector-tiles: Adjust maxzoom to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/730634 [20:19:30] (03CR) 10Brennen Bearnes: [C: 03+1] gitlab: deactivate new backup-restore unit for now [puppet] - 10https://gerrit.wikimedia.org/r/730632 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [20:19:52] (03PS2) 10Dzahn: gitlab: deactivate new backup-restore unit for now [puppet] - 10https://gerrit.wikimedia.org/r/730632 (https://phabricator.wikimedia.org/T285867) [20:20:59] (03CR) 10BryanDavis: [C: 03+2] toolhub: Get mcrouter image tags from upstream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/723618 (https://phabricator.wikimedia.org/T291530) (owner: 10BryanDavis) [20:21:39] (03PS2) 10MSantos: tegola-vector-tiles: Adjust maxzoom to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/730634 [20:25:24] (03Merged) 10jenkins-bot: toolhub: Get mcrouter image tags from upstream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/723618 (https://phabricator.wikimedia.org/T291530) (owner: 10BryanDavis) [20:26:01] (03PS1) 10BryanDavis: toolhub: Bump container version to 2021-10-13-195718-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730635 (https://phabricator.wikimedia.org/T293159) [20:26:13] (03PS3) 10Dzahn: gitlab: deactivate new backup-restore unit for now [puppet] - 10https://gerrit.wikimedia.org/r/730632 (https://phabricator.wikimedia.org/T285867) [20:26:47] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Adjust maxzoom to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/730634 (owner: 10MSantos) [20:27:40] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1021.eqiad.wmnet with OS stretch [20:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kubernetes1021.eqiad.wmnet with OS stretch completed: - kubernetes1... [20:27:53] 10SRE, 10LDAP-Access-Requests: Grant Access to (some Superset dashboards) for - https://phabricator.wikimedia.org/T292575 (10Etonkovidova) >>! In T292575#7412331, @Kormat wrote: > Hi @etonkovidova, your existing shell account was deactivated, so i've reinstated it now. You're already in th... [20:29:25] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/31651/gitlab2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/730632 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [20:30:08] (03PS1) 10Zabe: Fall back to main page if given title is invalid [extensions/Collection] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730574 (https://phabricator.wikimedia.org/T293299) [20:30:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10RobH) 05Open→03Resolved a:05Cmjohnson→03RobH [20:30:32] (03Merged) 10jenkins-bot: tegola-vector-tiles: Adjust maxzoom to 15 [deployment-charts] - 10https://gerrit.wikimedia.org/r/730634 (owner: 10MSantos) [20:31:35] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2021-10-13-195718-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730635 (https://phabricator.wikimedia.org/T293159) (owner: 10BryanDavis) [20:31:50] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [20:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:30] dancy: wanna backport the fix for T293299? I can test it. [20:32:30] T293299: TypeError: Argument 1 passed to CollectionHooks::getBookCreatorBoxContent() must be an instance of Title, null given, called in /srv/mediawiki/php-1.38.0-wmf.4/extensions/Collection/includes/Api/ApiGetBookCreatorBoxContent.php on line 39 - https://phabricator.wikimedia.org/T293299 [20:33:03] (03CR) 10Dzahn: [C: 03+2] gitlab: remove unnecessary comments from restore script [puppet] - 10https://gerrit.wikimedia.org/r/730630 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [20:35:54] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2021-10-13-195718-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730635 (https://phabricator.wikimedia.org/T293159) (owner: 10BryanDavis) [20:36:31] zabe: dancy may be afk, but i can go ahead and sling that one out. [20:36:44] cool [20:37:32] (03CR) 10Brennen Bearnes: [C: 03+2] Fall back to main page if given title is invalid [extensions/Collection] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730574 (https://phabricator.wikimedia.org/T293299) (owner: 10Zabe) [20:40:30] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [20:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:04] (03Merged) 10jenkins-bot: Fall back to main page if given title is invalid [extensions/Collection] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730574 (https://phabricator.wikimedia.org/T293299) (owner: 10Zabe) [20:45:02] zabe: should be on mwdebug1002 [20:45:58] works, https://mh.wikipedia.org/w/api.php?action=collection&submodule=getbookcreatorboxcontent&pagename=%3C is no longer throwing a type error [20:46:21] (03PS1) 10MSantos: tegola-vector-tiles: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/730642 [20:46:23] (03PS1) 10Dzahn: gitlab: allow installing the restore script while NOT enabling the timer [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) [20:46:28] brennen: ^ [20:46:50] !log bd808@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [20:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:30] zabe: thanks - syncing. [20:48:17] (03CR) 10jerkins-bot: [V: 04-1] gitlab: allow installing the restore script while NOT enabling the timer [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [20:48:45] (03PS2) 10Dzahn: gitlab: allow installing the restore script while NOT enabling the timer [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) [20:48:59] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/Collection/includes/Api/ApiGetBookCreatorBoxContent.php: Backport: [[gerrit:730574|Fall back to main page if given title is invalid (T293299)]] (duration: 01m 04s) [20:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:05] T293299: TypeError: Argument 1 passed to CollectionHooks::getBookCreatorBoxContent() must be an instance of Title, null given, called in /srv/mediawiki/php-1.38.0-wmf.4/extensions/Collection/includes/Api/ApiGetBookCreatorBoxContent.php on line 39 - https://phabricator.wikimedia.org/T293299 [20:50:35] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [20:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:35] (03CR) 10jerkins-bot: [V: 04-1] gitlab: allow installing the restore script while NOT enabling the timer [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [20:52:40] zabe: Thanks for fixing that. I appreciate, do you want to also look at this - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Collection/+/730563 while we're at it? Thanks! [20:54:39] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/730642 (owner: 10MSantos) [20:54:43] (03PS3) 10Dzahn: gitlab: allow installing the restore script while NOT enabling the timer [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) [20:55:10] Sure I can a look at it, but since I don't have +2, I can't give the final approval. [20:57:57] (03CR) 10Dzahn: [V: 04-1] "not yet since it touches gitlab1001 which it shouldn't - https://puppet-compiler.wmflabs.org/compiler1002/31652/" [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [20:58:44] Thanks for the fixes! [20:59:08] (03Merged) 10jenkins-bot: tegola-vector-tiles: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/730642 (owner: 10MSantos) [21:00:17] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Jclark-ctr) 05Open→03Resolved Drive Arrived today Replaced [21:00:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10Jclark-ctr) 05Open→03Resolved Drive Arrived today Replaced [21:00:57] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [21:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:33] !log removing 2 files for legal compliance [21:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:44] (03PS4) 10Dzahn: gitlab: allow installing the restore script while NOT enabling the timer [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) [21:05:33] zabe: Okay no problem. I'll poke duesen to merge when he has time. Thank you very much. [21:05:55] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Eevans) 05Resolved→03Open a:05Jclark-ctr→03hnowlan The arrays are still degraded (the device state is removed); I think there is still more yet to be done [21:07:23] zabe: xSavitar: +2'ed, code looks sane, was tested and is covered. [21:07:37] urbanecm: Thank you v much! [21:07:41] \o/ [21:08:14] xSavitar: did you take a look at T293300? [21:08:15] T293300: PHP Notice: Undefined index: timestamp - https://phabricator.wikimedia.org/T293300 [21:08:29] any time xSavitar [21:09:20] zabe: Looking now... [21:10:28] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/31653/" [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [21:12:21] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:15:15] (03PS5) 10Dzahn: gitlab: allow installing the restore script while NOT enabling the timer [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) [21:17:14] zabe: The migration from ajax to api is really exposing some stuff in that code base. [21:17:29] I'll have to spend some time and improve the code itself. All the issues are in the underlying code. [21:17:51] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/31654/" [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [21:18:09] sure, justed wanted to make sure you are aware of that task :) [21:18:14] * just [21:19:23] zabe: Does https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Collection/+/730645 make sense to you? [21:19:33] Not sure of a better way to fix this one :) [21:19:44] Apart from just checking we don't offset on timestamp [21:19:53] (03CR) 10AOkoth: [C: 03+1] gitlab: allow installing the restore script while NOT enabling the timer [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [21:21:01] (03CR) 10Dzahn: "I know this is a bit ugly but it DOES fix the issue. puppet installs the script on 2001 but does NOT enable the timer and 1001 is still un" [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [21:24:49] (03CR) 10Dzahn: "fyi Jelto and Brennen, this way Arnold is unblocked and can further debug the restore script. without it either puppet would remove the fi" [puppet] - 10https://gerrit.wikimedia.org/r/730641 (https://phabricator.wikimedia.org/T285867) (owner: 10Dzahn) [21:26:24] The fix makes sence sence to me, but I don't really understand why the failure is new in wmf.4. [21:27:31] (03CR) 10Dzahn: admin/otrs: create new root admin group vrts-admins, add Arnold (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [21:27:49] (03PS1) 10Zabe: Api: Avoid trying to access undefined offset in a user's collection [extensions/Collection] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730577 (https://phabricator.wikimedia.org/T293261) [21:28:05] (03PS3) 10Dzahn: admin/otrs: create new root admin group vrts-roots, add Arnold [puppet] - 10https://gerrit.wikimedia.org/r/728648 [21:28:39] (03PS4) 10Dzahn: admin/otrs: create new root admin group vrts-roots, add Arnold [puppet] - 10https://gerrit.wikimedia.org/r/728648 [21:28:47] zabe: Okay! I wonder too [21:28:56] (03CR) 10Dzahn: [C: 03+2] admin/otrs: create new root admin group vrts-roots, add Arnold [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [21:29:06] urbanecm: Do you want to deploy zabe's backport if a window is going on? [21:29:28] (03PS1) 10Zabe: Api: Avoid trying to access undefined offset in a user's collection [extensions/Collection] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730578 (https://phabricator.wikimedia.org/T293261) [21:29:31] this one: https://gerrit.wikimedia.org/r/730577 please [21:29:52] oh now, backports o_o [21:29:58] xSavitar: I'll go to bed now, I won't be able to make the next window. [21:30:14] Oh okay no problem. Please get some sleep. [21:30:23] zabe: Maybe we schedule something for the next window? [21:30:42] (03CR) 10Dzahn: "thanks Joanna, deploying" [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [21:30:59] (03CR) 10Dzahn: "Majavah: good catch, fixed" [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [21:31:31] (03CR) 10Dzahn: "Arnold" [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [21:32:04] (03CR) 10Dzahn: "@Alexandros: [otrs1001:~] $ id aokoth - uid=33489(aokoth) gid=500(wikidev) groups=500(wikidev),707(sre-admins),834(vrts-roots)" [puppet] - 10https://gerrit.wikimedia.org/r/728648 (owner: 10Dzahn) [21:32:05] yes [21:32:17] (03CR) 10jerkins-bot: [V: 04-1] Api: Avoid trying to access undefined offset in a user's collection [extensions/Collection] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730578 (https://phabricator.wikimedia.org/T293261) (owner: 10Zabe) [21:32:22] which window is up next so we can schedule? [21:33:11] jouncebot next [21:33:11] In 1 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T2300) [21:33:29] joucebot now [21:33:37] jouncebot now [21:33:38] No deployments scheduled for the next 1 hour(s) and 26 minute(s) [21:35:28] If you have something ready to fix a train blocker before then, let me know. [21:36:09] dancy: thank you! [21:36:13] jouncebot: now [21:36:14] No deployments scheduled for the next 1 hour(s) and 23 minute(s) [21:36:19] dancy: were talking about T293261, which is technically not a train blocker, but quite a lot of logspam [21:36:20] T293261: PHP Notice: Undefined offset: 2 - https://phabricator.wikimedia.org/T293261 [21:36:41] zabe: why does the patch on wmf.3? Any idea? [21:36:57] (03PS2) 10Zabe: Api: Avoid trying to access undefined offset in a user's collection [extensions/Collection] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730578 (https://phabricator.wikimedia.org/T293261) [21:37:11] zabe: thx: I'll be very happy to have that one cleaned up. [21:37:52] dancy: And this too, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Collection/+/730645 (T293300) [21:37:52] T293300: PHP Notice: Undefined index: timestamp - https://phabricator.wikimedia.org/T293300 [21:38:22] xSavitar: I backported to wmf.3 since according to the task the error is showing up in wmf.3 [21:38:37] okay [21:39:19] but the tests are failing since the api modules you moved are 'new' in wmf.4. I just removed them for the backport. Should be fine imo. [21:39:41] dancy: can we start with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Collection/+/730577 [21:40:44] sure. Is it testable? [21:42:53] (03CR) 10Ahmon Dancy: [C: 03+2] Api: Avoid trying to access undefined offset in a user's collection [extensions/Collection] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730577 (https://phabricator.wikimedia.org/T293261) (owner: 10Zabe) [21:43:02] I think only in a way that there is no error showing up in the logs. [21:43:19] (I at least don't know any other way) [21:43:34] zabe: you're right, that's the only way. [21:44:07] Trying to edit an out of range item will just return now [21:44:07] Alrighty. [21:44:33] ooh, I see units in the commit. That is nice. [21:44:39] *unit tests [21:44:59] dancy: :) [21:46:08] (03Merged) 10jenkins-bot: Api: Avoid trying to access undefined offset in a user's collection [extensions/Collection] (wmf/1.38.0-wmf.4) - 10https://gerrit.wikimedia.org/r/730577 (https://phabricator.wikimedia.org/T293261) (owner: 10Zabe) [21:46:32] Alright, about to let 'er reip [21:46:34] *rip [21:46:43] Apologies for bad typing this afternoon [21:47:42] !log removing 8 files for legal compliance [21:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:15] !log dancy@deploy1002 Synchronized php-1.38.0-wmf.4/extensions/Collection: Backport: [[gerrit:730577|Api: Avoid trying to access undefined offset in a user's collection (T293261)]] (duration: 01m 04s) [21:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:21] T293261: PHP Notice: Undefined offset: 2 - https://phabricator.wikimedia.org/T293261 [21:52:45] btw does this fix cover both `/w/api.php?action=collection-renamechapter&chaptername=Renamed&index=2 PHP Notice: Undefined offset: 2` and `/w/api.php?action=collection&submodule=renamechapter&index=5&chaptername=chaptername PHP Notice: Undefined offset: 5` ? [21:54:51] It should (the second looks like my testing example). [21:55:13] Could we also backport to wmf.3: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Collection/+/730578 [21:55:35] ok [21:55:49] (03CR) 10Ahmon Dancy: [C: 03+2] Api: Avoid trying to access undefined offset in a user's collection [extensions/Collection] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730578 (https://phabricator.wikimedia.org/T293261) (owner: 10Zabe) [21:58:14] dancy: yes, like zabe said. It should check any index out of range when trying to rename a chapter in a user's collection book :) [21:59:25] (03Merged) 10jenkins-bot: Api: Avoid trying to access undefined offset in a user's collection [extensions/Collection] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/730578 (https://phabricator.wikimedia.org/T293261) (owner: 10Zabe) [21:59:25] thx both. [22:01:16] !log dancy@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/Collection/includes/Specials/SpecialCollection.php: Backport: [[gerrit:730578|Api: Avoid trying to access undefined offset in a user's collection (T293261)]] (duration: 01m 04s) [22:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:22] T293261: PHP Notice: Undefined offset: 2 - https://phabricator.wikimedia.org/T293261 [22:13:19] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:30:13] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [22:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:25] (03PS1) 10Dzahn: gzip the test index.html [container/miscweb] - 10https://gerrit.wikimedia.org/r/730653 [22:48:51] (03CR) 10Dzahn: [C: 03+2] gzip the test index.html [container/miscweb] - 10https://gerrit.wikimedia.org/r/730653 (owner: 10Dzahn) [22:51:42] (03Merged) 10jenkins-bot: gzip the test index.html [container/miscweb] - 10https://gerrit.wikimedia.org/r/730653 (owner: 10Dzahn) [22:52:35] (03CR) 10Cwhite: [C: 03+2] logstash: move kubernetes_docker parsing to priority 15 [puppet] - 10https://gerrit.wikimedia.org/r/728683 (https://phabricator.wikimedia.org/T292099) (owner: 10Cwhite) [23:00:05] RoanKattouw and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211013T2300). [23:00:05] No Gerrit patches in the queue for this window AFAICS. [23:10:18] urbanecm: Hello? Online? [23:12:54] Any developer available? [23:13:13] "UTC late backport window" [23:14:03] hi Juan_90264, there might not be any deployers available :( [23:15:05] MatmaRex: Sad, I believe I'll make it tomorrow [23:18:47] Strange that I'm seeing several green-lighted imprementadors on https://web.libera.chat [23:21:56] tgr_ and James_F: ? [23:23:24] doing [23:23:47] Hello is available? [23:27:09] (03PS1) 10BryanDavis: toolhub: Bump container version to 2021-10-13-231209-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730662 (https://phabricator.wikimedia.org/T293103) [23:27:25] (03CR) 10Gergő Tisza: [C: 03+2] Create an alias for the project namespace on kswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730575 (https://phabricator.wikimedia.org/T291740) (owner: 10Juan90264) [23:28:13] (03Merged) 10jenkins-bot: Create an alias for the project namespace on kswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730575 (https://phabricator.wikimedia.org/T291740) (owner: 10Juan90264) [23:28:34] Great, thanks tgr! [23:31:13] (03PS1) 10Dzahn: miscweb: bump version to 2021-10-13-225516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730663 [23:31:54] (03CR) 10Dzahn: [C: 03+2] miscweb: bump version to 2021-10-13-225516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730663 (owner: 10Dzahn) [23:36:05] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:730575|Create an alias for the project namespace on kswiki (T291740)]] (duration: 01m 05s) [23:36:09] (03Merged) 10jenkins-bot: miscweb: bump version to 2021-10-13-225516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/730663 (owner: 10Dzahn) [23:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:11] T291740: Creation of namespace aliases at Ks Wikipedia - https://phabricator.wikimedia.org/T291740 [23:36:32] Juan_90264: done [23:37:26] Perfect, thanks tgr|away! [23:37:32] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [23:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log