[00:31:27] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[00:33:47] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[01:28:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:36:45] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:49:57] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:06:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:50] Hey guys. Can I get some clarification on something?
[02:08:06] I got this while trying to diagnose erratic IABot behavior
[02:08:07] If you report this error to the Wikimedia System Administrators, please include the details below.
Request from 185.15.56.22 via cp1085 cp1085, Varnish XID 1062733635
Upstream caches: cp1085 int
Error: 429, Too Many Requests at Mon, 19 Sep 2022 02:02:55 GMT
[02:08:41] I know IABot can be a bit heavy on I/O sometimes, but it shouldn't be THAT heavy to trigger this.
[02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:42] Cyberpower678: rate limits are often dynamic, changing based on whatever is going on. I would recommend that your bot just wait for the length specified in the `retry-after`and retry
[02:19:07] legoktm: It would be nice to know if I can get an idea of just how much the bot is pushing the production servers though. It shouldn't ever be hitting it hard to warrant a 429. Except maybe, when it initializes and tries to import a bunch of template metadata.
[02:19:13] PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:19:37] I can't answer that part, but I think the 429 is the sign to slow down
[02:20:29] It's a little harder than that. The bot is a global bot running on 140+ ish wikis concurrently. Each wiki runs on a different process, and they don't really communicate with each other.
[02:20:44] (This is something that will be addressed in the IABot rewrite being planned)
[02:21:39] So another process won't know that it's rate limiting production and keep retrying.
[02:22:31] But either way, legoktm do you think you could poke an op to maybe feed me some request logs originating from an IABot UA?
[02:22:55] rate limits are global
[02:23:05] Well that's unfortunate.
[02:23:06] which is why building in 429 handling is important
[02:23:15] I thought it was local
[02:23:33] This also presents somewhat of a scaling issue.
[02:23:33] MediaWiki rate limits are per-wiki (though some are global!), but these are enforced by the caching layer
[02:23:54] if you need logs, it would be best to file a task under SRE in Phab
[02:24:07] Link?
[02:24:23] https://phabricator.wikimedia.org/project/view/1025/
[02:24:41] * Cyberpower678 notes that the new version will have much more extensive logging built in.
[02:25:03] Thank you
[02:27:48] :)
[02:32:36] 10SRE, 10InternetArchiveBot: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678)
[02:32:45] 10SRE, 10InternetArchiveBot: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) p:05Triage→03High
[02:36:01] legoktm: ^
[02:36:40] ok, it'll get triaged by whoever is on clinic duty this week
[02:36:47] Let's hope that there's just some big inefficiency going on that can easily be dealt with. Otherwise, the bot's going to be down for a bit while I work to implement 429 handling
[03:20:27] RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:54:25] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:02:56] 10SRE, 10Commons, 10WMF-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10YitNat) Same issue. Uploading 20 MB webm file on 80 KB/s speed upload connection. On Chromium (Brave Browser) it shows: > ERR_HTTP2_PROTOCOL_ERROR On...
[05:57:01] (03PS2) 10ArielGlenn: switch snapshot hosts to use php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/827954 (https://phabricator.wikimedia.org/T271736)
[05:59:01] (03CR) 10ArielGlenn: [C: 03+2] switch snapshot hosts to use php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/827954 (https://phabricator.wikimedia.org/T271736) (owner: 10ArielGlenn)
[06:17:50] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832715 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[06:49:43] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:04] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220919T0700).
[07:00:04] MdsShakil: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:08] o/
[07:01:10] hi MdsShakil, around?
[07:01:21] urbanecm: yes
[07:01:45] great!
[07:04:00] (03CR) 10Urbanecm: [C: 03+2] Remove unnecessary wgNamespaceAliases from bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832683 (https://phabricator.wikimedia.org/T318003) (owner: 10MdsShakil)
[07:04:46] (03Merged) 10jenkins-bot: Remove unnecessary wgNamespaceAliases from bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832683 (https://phabricator.wikimedia.org/T318003) (owner: 10MdsShakil)
[07:05:37] MdsShakil: your patch is at mwdebug1001, please test
[07:08:59] MdsShakil: how is it going?
[07:09:32] urbanecm: sorry, looking good to me
[07:10:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:11:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:11:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:11:47] MdsShakil: great, syncing
[07:12:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:16:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4a6c1ddf5cd1a46ab05f5d6fda4b938a3ee37238: Remove unnecessary wgNamespaceAliases from bnwiki (T318003) (duration: 04m 16s)
[07:16:50] T318003: Remove unnecessary wgNamespaceAliases from bnwiki - https://phabricator.wikimedia.org/T318003
[07:16:52] And, done.
[07:16:59] Took bit longer than expected, but succeeded.
[07:17:52] urbanecm: Thank you
[07:22:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:26:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:26:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:30:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:32:42] hi, I might add a config patch to this window, if there's time
[07:49:34] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable image recommendations for el/pl/zh/id/ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832959 (https://phabricator.wikimedia.org/T314518)
[07:50:43] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable image recommendations for el/pl/zh/id/ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832959 (https://phabricator.wikimedia.org/T314518)
[07:51:01] cc urbanecm ^
[07:51:15] I can wait until later if we're too close to end of the window
[07:58:56] meh, let's leave it for later
[08:01:45] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:03:15] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:17:55] PROBLEM - Check systemd state on logstash1010 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:02] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Enable image recommendations for el/pl/zh/id/ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832959 (https://phabricator.wikimedia.org/T314518) (owner: 10Kosta Harlan)
[08:52:56] kostajh: sorry, i wasnt monitoring IRC after the deployment :/
[08:53:12] Looking forward to seeing that project on more wikis though!
[08:53:55] urbanecm: it's ok, ran into some issues with updating MediaWiki:NewcomerTasks.json anyway
[09:12:59] RECOVERY - Check systemd state on logstash1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:04:23] kostajh: i see. lmk if i can help with those issues somehow
[10:36:59] (03PS1) 10Andrew Bogott: Cloudvirts: remove libguestfs-tools dependency [puppet] - 10https://gerrit.wikimedia.org/r/832977 (https://phabricator.wikimedia.org/T317344)
[10:46:07] (03CR) 10Andrew Bogott: "This is causing a very noisy type mismatch on all the gitlab-runner nodes, maybe an encoding issue?" [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall)
[10:49:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:54:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:19:35] (03PS1) 10KartikMistry: Update cxserver to 2022-09-15-113346-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/832989 (https://phabricator.wikimedia.org/T317289)
[11:23:53] (03PS1) 10KartikMistry: testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832993 (https://phabricator.wikimedia.org/T317289)
[11:59:45] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[12:13:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10CMyrick-WMF) HI Brett, I would like to use to the [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter | JupyterH...
[12:33:48] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable Tech Wishes survey on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832999 (https://phabricator.wikimedia.org/T316676) (owner: 10Awight)
[12:36:46] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[12:39:43] (03PS3) 10Abijeet Patro: Add editcontentmodel right for translation administrators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587)
[12:44:00] (03CR) 10Urbanecm: [C: 04-1] "code lgtm, see inline comment for commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830817 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro)
[12:49:25] (03CR) 10Thcipriani: buildkitd: Support configuration of OCI executor nameservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832584 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall)
[12:56:46] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220919T1300).
[13:00:04] kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] I can deploy today!
[13:00:12] hi kostajh
[13:00:36] \o/
[13:01:16] hello
[13:01:26] thanks urbanecm
[13:01:48] kostajh: afaics, all wikis but plwiki don't have image-recommendation in NewcomerTasks.json. Is that intended/ok?
[13:02:02] urbanecm: eh, no... let me look again
[13:02:06] sure
[13:02:31] urbanecm: they should all have it, am I missing something?
[13:03:19] kostajh: oh, my mistake. i was checking the page history, didn't realize it might get there via Special:EditGrowthConfig.
[13:03:25] urbanecm: see https://phabricator.wikimedia.org/T314518#8245095 for edits made to support this patch. Only plwiki needed the addition of image-recommendation, the others already had it
[13:03:34] yep, missed that. sorry :)
[13:03:35] let's go ahead!
[13:03:39] whew :)
[13:03:39] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable image recommendations for el/pl/zh/id/ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832959 (https://phabricator.wikimedia.org/T314518) (owner: 10Kosta Harlan)
[13:04:39] (03Merged) 10jenkins-bot: GrowthExperiments: Enable image recommendations for el/pl/zh/id/ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832959 (https://phabricator.wikimedia.org/T314518) (owner: 10Kosta Harlan)
[13:05:16] kostajh: pulled to mwdebug1001. can you check please?
[13:05:23] urbanecm: yep, looking
[13:08:31] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:09:35] urbanecm: the feature looked good for existing users, but seeing a JS error on task type selection for new accounts. Not sure if it's related, need a few minutes
[13:09:45] sure, waiting
[13:10:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:10:11] (03PS1) 10MVernon: hieradata: remove ms-be20[28-39] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/833007 (https://phabricator.wikimedia.org/T294549)
[13:11:43] urbanecm: looks good
[13:11:54] i see a DB error in logstash: `Error connecting to db1189 as user wikiuser202206: :real_connect(): (HY000/2002): Connection refused`
[13:12:20] there is no way how a GE config change can cause that, but it is suspicious anyway
[13:12:32] urbanecm: the error I got was attempting to set task filters as a logged-out user. I guess I was somehow logged-out.
[13:12:40] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Jdforrester-WMF) >>! In T205361#8060540, @gerritbot wrote: > Change 774943...
[13:12:48] ack. syncing.
[13:12:49] probably another manifestation of T299193
[13:12:50] T299193: MediaWiki login failure due to race condition with session cookie - https://phabricator.wikimedia.org/T299193
[13:14:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:14:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:17:09] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cbf161d148228e0e706813f923ab1a5d4b42757a: GrowthExperiments: Enable image recommendations for el/pl/zh/id/ro (T314518) (duration: 04m 01s)
[13:17:12] T314518: Scale: deploy "add an image" to el, pl, zh, id, ro - https://phabricator.wikimedia.org/T314518
[13:17:15] kostajh: and should be live
[13:17:16] anything else?
[13:17:23] \o/
[13:17:47] urbanecm: I don't think so. I had a question about whether we should roll out mentor overview Vue to all wikis, but we can leave that for another time, if you want to wait longer
[13:18:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:18:48] I'm leaving that to Sergio. Personally, I didn't see any complaints, and I'm comfortable rolling out, but I think it should be mainly Sergio's call, as he's working on the migration.
[13:20:20] urbanecm: ack
[13:52:53] urbanecm, still around?
[13:52:59] zabe: yes, what's up?
[13:53:29] would you have time to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/832623? the window isn't over yet ;)
[13:53:33] sure
[13:53:55] cool :)
[13:54:02] (03CR) 10Urbanecm: [C: 03+2] Regenerate ukwikivoyage logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832623 (https://phabricator.wikimedia.org/T317718) (owner: 10Zabe)
[13:54:05] (03PS2) 10Urbanecm: Regenerate ukwikivoyage logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832623 (https://phabricator.wikimedia.org/T317718) (owner: 10Zabe)
[13:54:09] (03CR) 10Urbanecm: [C: 03+2] Regenerate ukwikivoyage logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832623 (https://phabricator.wikimedia.org/T317718) (owner: 10Zabe)
[13:55:32] (03Merged) 10jenkins-bot: Regenerate ukwikivoyage logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832623 (https://phabricator.wikimedia.org/T317718) (owner: 10Zabe)
[13:57:11] zabe: pulled to mwdebug1001, can you verify,
[13:57:12] ?
[13:57:51] lgtm
[13:57:57] urbanecm, ^
[13:58:04] thanks, deploying
[13:58:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:59:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:59:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:00:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:02:10] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 6c7151d969b6997bd9cce042b7bc78c282dd9b26: Regenerate ukwikivoyage logo (T317718) (duration: 03m 46s)
[14:02:14] T317718: Logo of Ukrainian Wikivoyage differs on different resolutions - https://phabricator.wikimedia.org/T317718
[14:02:15] zabe: and, live
[14:02:21] purging the logo files now
[14:03:12] thanks!
[14:03:33] !log Purge https://en.wikipedia.org/static/images/project-logos/ukwikivoyage{.png,-1.5x.png,-2x.png} (T317718)
[14:03:35] and, done
[14:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:47] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[15:07:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10odimitrijevic) Approved!
[15:07:51] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:17:51] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking)
[15:25:47] (03PS3) 10BCornwall: admin: Add cmyrick to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/832716 (https://phabricator.wikimedia.org/T317996)
[15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220919T1530).
[15:30:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10BCornwall)
[15:39:36] (03PS2) 10KartikMistry: testwiki: Enable Section Translation on haw, la, ps and, xh Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832993 (https://phabricator.wikimedia.org/T317289)
[15:56:20] (03CR) 10Ssingh: "krb: present should be added since Kerberos identity was requested." [puppet] - 10https://gerrit.wikimedia.org/r/832716 (https://phabricator.wikimedia.org/T317996) (owner: 10BCornwall)
[15:56:48] (03PS4) 10BCornwall: admin: Add cmyrick to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/832716 (https://phabricator.wikimedia.org/T317996)
[15:58:43] (03CR) 10Ssingh: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/832716 (https://phabricator.wikimedia.org/T317996) (owner: 10BCornwall)
[15:59:45] (03PS1) 10Ebernhardson: Add token_count subfield to outgoing_link [extensions/CirrusSearch] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/833031 (https://phabricator.wikimedia.org/T317546)
[16:02:53] (03CR) 10BCornwall: [C: 03+2] admin: Add cmyrick to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/832716 (https://phabricator.wikimedia.org/T317996) (owner: 10BCornwall)
[16:04:55] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/833020
[16:12:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10BCornwall) The request has been merged and you should have received the Kerberos password through email. @CMyrick-WMF Can you...
[16:15:51] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:23:25] (03PS1) 10Dduvall: P:gitlab::runner: $nameservers parameter type should match aliased [puppet] - 10https://gerrit.wikimedia.org/r/833046 (https://phabricator.wikimedia.org/T317904)
[16:24:25] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:44:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for CMyrick-WMF - https://phabricator.wikimedia.org/T317996 (10BCornwall) a:05BCornwall→03CMyrick-WMF
[16:50:48] (03CR) 10Dduvall: "Cherry picked on the standalone puppetmaster and seems to fix the noisy type mismatch errors." [puppet] - 10https://gerrit.wikimedia.org/r/833046 (https://phabricator.wikimedia.org/T317904) (owner: 10Dduvall)
[16:51:09] (03CR) 10Ebernhardson: [V: 03+1 C: 04-1] "this might not be needed, we are alternatively considering munging the dumps in yarn and uploading the results to swift, this would cut 20" [puppet] - 10https://gerrit.wikimedia.org/r/832543 (https://phabricator.wikimedia.org/T222349) (owner: 10Ebernhardson)
[16:56:19] (03PS3) 10Jforrester: ExtensionDistributor: Add REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829877 (https://phabricator.wikimedia.org/T313925)
[16:57:59] jouncebot: next
[16:57:59] In 0 hour(s) and 2 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220919T1700)
[16:58:09] Oh well.
[17:00:05] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220919T1700).
[17:01:57] (03PS1) 10Zabe: build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833057
[17:20:37] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:04] !log dancy@deploy1002 Started scap: testing, disregard
[17:36:26] !log dancy@deploy1002 dancy: testing, disregard synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[17:36:30] !log dancy@deploy1002 Sync cancelled.
[17:40:11] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10Devnull)
[17:42:52] !log dancy@deploy1002 Installing scap version "4.21.0" for 561 hosts
[17:43:11] !log dancy@deploy1002 Installation of scap version "4.21.0" completed for 561 hosts
[17:45:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:49:31] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[17:50:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:51:13] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Krinkle) >>! In T205361#7945628, @Legoktm wrote: >>>! In T205361#7815573, @...
[17:55:56] (03PS1) 10Dduvall: buildkitd: Install wmf-certificates for registry CA [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833067 (https://phabricator.wikimedia.org/T318019)
[17:58:14] (03CR) 10Dduvall: buildkitd: Bump version to 0.10.4 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/830909 (owner: 10Dduvall)
[18:03:05] PROBLEM - Check systemd state on ms-be1057 is CRITICAL: CRITICAL - degraded: The following units failed: rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:12:41] RECOVERY - Check systemd state on ms-be1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:49:34] 10SRE, 10Traffic, 10Performance-Team (Radar): Review socket balancing in ATS/Varnish traffic layers - https://phabricator.wikimedia.org/T248522 (10Krinkle)
[18:53:38] (03PS8) 10BCornwall: Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232)
[19:00:09] (03PS1) 10Ebernhardson: sre.wdqs.data-reload: Simplify passing a timestamp for kafka [cookbooks] - 10https://gerrit.wikimedia.org/r/833082
[19:01:10] (03PS9) 10BCornwall: Unlink certificate renewal and OCSP handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232)
[19:01:23] (03CR) 10BCornwall: Unlink certificate renewal and OCSP handling (033 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/820795 (https://phabricator.wikimedia.org/T244232) (owner: 10BCornwall)
[19:04:55] (03CR) 10CI reject: [V: 04-1] sre.wdqs.data-reload: Simplify passing a timestamp for kafka [cookbooks] - 10https://gerrit.wikimedia.org/r/833082 (owner: 10Ebernhardson)
[19:10:03] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirts: remove libguestfs-tools dependency [puppet] - 10https://gerrit.wikimedia.org/r/832977 (https://phabricator.wikimedia.org/T317344) (owner: 10Andrew Bogott)
[19:11:51] (03PS2) 10Ebernhardson: sre.wdqs.data-reload: Simplify passing a timestamp for kafka [cookbooks] - 10https://gerrit.wikimedia.org/r/833082
[19:14:17] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:14:45] 10SRE, 10DNS, 10Domains, 10WMF-Legal: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508 (10BCornwall)
[19:14:53] 10SRE, 10DNS, 10Domains, 10WMF-Legal: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508 (10BCornwall) 05Open→03Invalid As the server appears to be dead and nearly all of those domain names being removed, I think this can safely be closed. If there'...
[19:19:25] (03CR) 10Bking: [C: 03+2] sre.wdqs.data-reload: Simplify passing a timestamp for kafka [cookbooks] - 10https://gerrit.wikimedia.org/r/833082 (owner: 10Ebernhardson)
[19:22:39] (03CR) 10Bking: [V: 03+2 C: 03+1] sre.wdqs.data-reload: Simplify passing a timestamp for kafka [cookbooks] - 10https://gerrit.wikimedia.org/r/833082 (owner: 10Ebernhardson)
[19:22:41] (03CR) 10Bking: [V: 03+2 C: 03+2] sre.wdqs.data-reload: Simplify passing a timestamp for kafka [cookbooks] - 10https://gerrit.wikimedia.org/r/833082 (owner: 10Ebernhardson)
[19:28:49] PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:30:30] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[19:30:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[19:31:04] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[19:33:11] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[19:33:17] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[19:33:29] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[19:35:18] 10SRE, 10PyBal, 10Traffic-Icebox: Backport ipvsadm - https://phabricator.wikimedia.org/T171850 (10BCornwall) 05Open→03Invalid ` $ sudo cumin '*lvs*' 'grep VERSION= /etc/os-release' [...] ----- OUTPUT of 'grep VERSION= /etc/os-release' -----...
[19:35:20] 10SRE, 10PyBal, 10Traffic-Icebox: PyBal Feature: progressive depooling strategy for monitored failures - https://phabricator.wikimedia.org/T172124 (10BCornwall)
[19:35:24] 10SRE, 10PyBal, 10Traffic-Icebox: IPVS issues with UDP services, pybal depooling strategy - https://phabricator.wikimedia.org/T172103 (10BCornwall)
[20:00:05] RoanKattouw, Urbanecm, and cjming: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220919T2000).
[20:00:05] arlolra, ebernhardson, James_F, and zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] Adsum.
[20:01:27] Anyone around to deploy? I can do it if needed.
[20:01:53] hey o/
[20:01:58] i can deploy - just getting set up
[20:02:06] Sure, thanks cjming.
[20:02:35] here
[20:02:48] (03CR) 10Jforrester: [C: 03+1] build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833057 (owner: 10Zabe)
[20:03:02] hi Arlolra - starting with yours
[20:03:07] thanks
[20:03:16] (03PS2) 10Clare Ming: Disable wgParserEnableLegacyMediaDOM on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832715 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[20:04:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832715 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[20:05:17] ebernhardson: are you around for your cirrus patch?
[20:05:57] (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832715 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra)
[20:06:15] !log cjming@deploy1002 Started scap: Backport for [[gerrit:832715|Disable wgParserEnableLegacyMediaDOM on cswiki (T314318)]]
[20:06:19] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318
[20:06:36] !log cjming@deploy1002 cjming and arlolra: Backport for [[gerrit:832715|Disable wgParserEnableLegacyMediaDOM on cswiki (T314318)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:06:37] Arlolra: can you verify on one of the test servers?
[20:06:53] yup, one sec
[20:08:07] Ok, looks good
[20:08:31] great - going live
[20:09:33] James_F: would you like to self-deploy your patches? (i'll do Erik's later if/when he shows up)
[20:09:57] cjming: Sure!
[20:10:04] And I can take zabe's whilst I'm at it?
[20:10:15] be my guest
[20:10:20] (03CR) 10Jforrester: [C: 03+2] ExtensionDistributor: Add REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829877 (https://phabricator.wikimedia.org/T313925) (owner: 10Jforrester)
[20:10:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:11:10]