Page MenuHomePhabricator

ssingh (Sukhbir Singh)
SRE/Traffic

Today

  • No visible events.

Tomorrow

  • No visible events.

Saturday

  • No visible events.

User Details

User Since
Dec 11 2018, 9:39 PM (358 w, 1 d)
Availability
Available
IRC Nick
sukhe
LDAP User
Unknown
MediaWiki User
SSingh (WMF) [ Global Accounts ]

Oh hi. Nice to see you here.

Recent Activity

Yesterday

ssingh closed T407966: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep, a subtask of T404826: Integrate code from the private repository into the CDN, as Resolved.
Wed, Oct 22, 7:10 PM · Hiddenparma, Traffic, SRE
ssingh closed T407966: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep as Resolved.

Sorry about this, this should now be fixed. And glad to see that Traffic was added automatically, thanks to @bd808 and @Ladsgroup for their work on this!

Wed, Oct 22, 7:10 PM · Traffic, Beta-Cluster-Infrastructure
ssingh assigned T408003: [Update DNS Record Request] - wikimedia.org to BCornwall.
Wed, Oct 22, 5:37 PM · Patch-For-Review, Traffic
ssingh added a comment to T404826: Integrate code from the private repository into the CDN.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197986 has caused puppet to break on deployment-cache-upload08.deployment-prep. Please help!

Wed, Oct 22, 3:08 PM · Hiddenparma, Traffic, SRE
ssingh added a comment to T407966: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep.

This is because of:

Wed, Oct 22, 2:55 PM · Traffic, Beta-Cluster-Infrastructure
ssingh added a comment to T390813: Upgrade End Of Support Junos.

@ssingh @Vgutierrez hello just checking in to see if you have a day and time for this for drmrs.
Thanks

Wed, Oct 22, 2:45 PM · Traffic, netops, Infrastructure-Foundations

Tue, Oct 21

ssingh added a comment to T404913: Transfer wikipedia.pt domain to community.

Hi @CRoslof: This is another ticket that we would like to take up and will need your help with so that we can reflect it in downstream services as well. Let me know if I should create a Zendesk thread for tracking by other Legal members? Thanks a lot of for bearing with us and helping us clean the ownership.

Tue, Oct 21, 7:18 PM · Traffic, Domains
ssingh added a project to T407787: Alertmanager triggers an alert on IRC and email after the alert has resolved: Spicerack.

It looks like spicerack should check that alerts for the downtimed host have been resolved (not in firing state) before deleting the silence/downtime with ALERTS{alertstate="firing", instance=~"cp5018:.*"}

Tue, Oct 21, 1:34 PM · Infrastructure-Foundations, Spicerack, SRE-tools, Traffic, Observability-Alerting

Mon, Oct 20

ssingh triaged T407787: Alertmanager triggers an alert on IRC and email after the alert has resolved as Low priority.
Mon, Oct 20, 7:01 PM · Infrastructure-Foundations, Spicerack, SRE-tools, Traffic, Observability-Alerting
ssingh created T407787: Alertmanager triggers an alert on IRC and email after the alert has resolved.
Mon, Oct 20, 7:01 PM · Infrastructure-Foundations, Spicerack, SRE-tools, Traffic, Observability-Alerting
ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mon, Oct 20, 5:18 PM · Traffic
ssingh edited projects for T407769: Improve how we build the 'haproxy_allowed_healthcheck_sources' list of IPs, added: Traffic; removed Traffic-Icebox.

Thanks for filing this task! I think this is a good idea to reduce the manual updates to this list, and something we have failed to keep updated. We will triage this after discussion in Traffic.

Mon, Oct 20, 3:31 PM · Traffic, SRE

Fri, Oct 17

ssingh added a comment to T332220: Acquire enwp.org.

Nice job indeed in pursuing this over the years, Brett!

Fri, Oct 17, 7:25 PM · Traffic, SRE, Domains
ssingh updated subscribers of T406880: hCaptcha: Implement alerts.

[Adding Raine @kamila as well.]

Fri, Oct 17, 7:23 PM · Patch-For-Review, Product Safety and Integrity (Sprint Mint Choc Chip Ice Cream (Oct 20 - Nov 7)), Observability-Alerting, WE4.2 Bot detection (WE4.2 hCaptcha account creation trial)
ssingh moved T407570: Test the impact of incremental increase in traffic for cache splitting experiments from Backlog to Actively Servicing on the Traffic board.
Fri, Oct 17, 1:20 PM · Traffic, Experimentation Lab
ssingh added a comment to T407570: Test the impact of incremental increase in traffic for cache splitting experiments.

Thanks for filling the task, @JVanderhoop-WMF. As per the discussion on Slack, the above sounds good.

Fri, Oct 17, 1:18 PM · Traffic, Experimentation Lab

Thu, Oct 16

ssingh closed T407421: cp7007 hardware issues after reboot as Resolved.

Thanks to @Jhancock.wm for the help with this!

Thu, Oct 16, 3:22 PM · DC-Ops, Traffic, ops-magru
ssingh added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

FWIW doing one or two hosts is more than enough. We will reimage them again anyway so it doesn't make sense IMO for you both to spend time upgrading all of them to trixie. If one or two reimage fine, please leave the rest to us.

Thu, Oct 16, 3:20 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops

Wed, Oct 15

ssingh assigned T407421: cp7007 hardware issues after reboot to BCornwall.
Wed, Oct 15, 7:13 PM · DC-Ops, Traffic, ops-magru
ssingh created T407421: cp7007 hardware issues after reboot.
Wed, Oct 15, 7:12 PM · DC-Ops, Traffic, ops-magru
ssingh added a comment to T407320: Package benthos/redpanda for trixie.

Do you happen to have a trixie host available that we can try the existing package on?

Wed, Oct 15, 2:24 PM · Observability-Logging, Traffic
ssingh added a comment to T407156: Request to create the 25.wikipedia.org domain + 301 redirect to the org site.

I was also looped into a new request today. As part of the birthday initiative, the Fundraising team is developing a customized donation portal under the donate.wiki domain. Would it be possible to set up a redirect for this new portal as well? I don’t have the final destination URL yet, but we’d like to create the domain donate.wikipedia25.org to redirect to the donation portal once it’s ready. Is this something you could help with too?

Wed, Oct 15, 1:22 PM · Traffic, DNS, Domains
ssingh added a comment to T405499: Remove lvs1018 L2 link to ssw1-e1-eqiad.

FWIW we have typically reimaged for this in the past. I am not suggesting, just sharing! And given that this is lvs1020, that might be OK? (Leaving to you both for the final decision.)

This is lvs1018. I'm comfortable enough either way, can I leave the decision with traffic?

I was thinking of trying to schedule this for tomorrow, Thurs Oct 16th if that worked for you guys?

Wed, Oct 15, 1:10 PM · DC-Ops, ops-eqiad, Infrastructure-Foundations, netops, SRE

Tue, Oct 14

ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Tue, Oct 14, 7:02 PM · Traffic
ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Tue, Oct 14, 6:59 PM · Traffic
ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Tue, Oct 14, 6:26 PM · Traffic
ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Tue, Oct 14, 6:25 PM · Traffic
ssingh added a comment to T406650: Copy the Traffic team on alerts for deployment-cache* hosts.

Thanks for working on this! We will try our best to follow up on our end in making sure that Puppet is not broken on the cache hosts in Beta.

Tue, Oct 14, 5:49 PM · User-bd808, Traffic, Beta-Cluster-Infrastructure
ssingh added a comment to T405499: Remove lvs1018 L2 link to ssw1-e1-eqiad.

FWIW we have typically reimaged for this in the past. I am not suggesting, just sharing! And given that this is lvs1020, that might be OK? (Leaving to you both for the final decision.)

Tue, Oct 14, 5:40 PM · DC-Ops, ops-eqiad, Infrastructure-Foundations, netops, SRE
ssingh closed T405102: Create boot environment of Bullseye with a 6.1 kernel , a subtask of T392851: Q4:rack/setup/install cp20[43-58] codfw, as Resolved.
Tue, Oct 14, 5:38 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops
ssingh closed T405102: Create boot environment of Bullseye with a 6.1 kernel as Resolved.

Thanks @MoritzMuehlenhoff for working on this and researching it. I am closing this for the reason mentioned above.

Tue, Oct 14, 5:38 PM · User-Elukey, SRE, Traffic, ops-codfw, DC-Ops
ssingh added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

Cross-posting the comment from T405102#11273708,

Tue, Oct 14, 5:32 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops
ssingh added a comment to T405102: Create boot environment of Bullseye with a 6.1 kernel .

Traffic discussed this in the team meeting today. We decided that given the above blocker, we should simply move to trixie and use OpenSSL (3.5.0) as shipped by trixie. The reasons for doing so are this: it doesn't make sense to upgrade to bookworm as that ships OpenSSL 3.0 and we have concerns around the performance (we haven't evaluated that as we have with 3.5 but we believe that to be the case from what we have seen).

Tue, Oct 14, 5:30 PM · User-Elukey, SRE, Traffic, ops-codfw, DC-Ops
ssingh added a comment to T401331: Request for a new request dataset for caching research.

Traffic has a dependency on Data Engineering for this to be executed, as we will need their help to run the query and generate the data set. That being said, we will make sure that we take this on when Data Engineering is ready (including doing the leg work).

Tue, Oct 14, 5:23 PM · Traffic, Data-Engineering
ssingh added a comment to T407194: Consider using EdDSA rather than RSA for MediaWiki session tokens.

(Is there anything -- including input -- required from Traffic on this? I am asking since we were added and can triage the task accordingly.)

Tue, Oct 14, 4:17 PM · MediaWiki-Platform-Team, Traffic, MediaWiki-Core-AuthManager
ssingh assigned T407156: Request to create the 25.wikipedia.org domain + 301 redirect to the org site to BCornwall.
Tue, Oct 14, 4:08 PM · Traffic, DNS, Domains

Thu, Oct 9

ssingh closed T406166: eqiad: 2 VM request for hCaptcha as Resolved.

Rolled out.

Thu, Oct 9, 5:48 PM · Traffic, vm-requests, Infrastructure-Foundations, SRE

Wed, Oct 8

ssingh closed T406141: Disable LVS paging for WDQS as Resolved.
Wed, Oct 8, 6:54 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Traffic
ssingh added a comment to T406141: Disable LVS paging for WDQS.

Thanks to a review by @bking, we have merged this. I think we can consider this as resolved. Thanks @LSobanski and @Gehel.

Wed, Oct 8, 6:54 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Traffic
ssingh added a comment to T389333: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ?.

Thanks @MoritzMuehlenhoff, that sounds like a good plan to me but leaving to @CDobbins for the final word.

Wed, Oct 8, 3:18 PM · Patch-For-Review, SRE, DNS, Traffic
ssingh added a project to T384425: Port DNS icinga checks to Alertmanager: Traffic.
Wed, Oct 8, 1:09 PM · Traffic, Observability-Alerting

Tue, Oct 7

ssingh added a comment to T406650: Copy the Traffic team on alerts for deployment-cache* hosts.

@bd808 and I discussed this today and decided to split this up in two parts:

Tue, Oct 7, 9:16 PM · User-bd808, Traffic, Beta-Cluster-Infrastructure
ssingh closed T406167: codfw: 2 VM request for hCaptcha as Resolved.

hcaptcha200[1-2].wikimedia.org are ready.

Tue, Oct 7, 1:25 PM · Traffic, vm-requests, Infrastructure-Foundations, SRE
ssingh added a comment to T406166: eqiad: 2 VM request for hCaptcha.

@ssingh FYI I ran the sre.dns.netbox cookbook just now as it alerted on being a diff, it removed the entries for hcaptcha1001.

The VM doesn't exist in Netbox right now and the IPs assigned for it were unreachable so I figured it was safe.

Tue, Oct 7, 1:09 PM · Traffic, vm-requests, Infrastructure-Foundations, SRE
ssingh added a comment to T406141: Disable LVS paging for WDQS.

Now that the other endpoints were added, is there anything else that needs to happen before the patch is deployed?

Tue, Oct 7, 1:07 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Traffic

Mon, Oct 6

ssingh added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Hi @akosiaris: Following up on this after a discussion during Traffic's planning with @Vgutierrez, and on behalf of the team.

Mon, Oct 6, 4:55 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic
ssingh closed T334884: Updating Netbox for LVS hosts in eqiad lvs10(1[789]|20) as Resolved.

Thanks for following up @cmooney. Other than the fact that we owed you a response on your last comment and never did -- that's on me of course -- yes, I think we can close it.

Mon, Oct 6, 2:50 PM · Infrastructure-Foundations, Traffic, SRE

Fri, Oct 3

ssingh added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

Status update: I was able to upgrade idrac+bios of most of the cp hosts, I'll review the remaining ones on Monday and I'll give a precise list in here if there will be outstanding problems or not.

The code to make the firmware upgrade cookbook to work is still under review, but I tested it and it works :)

Fri, Oct 3, 4:21 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops
ssingh changed the status of T398596: Consider using the alternate chain of Google Trust Services certificates from Open to Stalled.

We will be resuming work on this in Q3 2025-2026.

Fri, Oct 3, 2:44 PM · MW-1.45-notes (1.45.0-wmf.12; 2025-07-29), Traffic
ssingh added a comment to T343000: HAProxy metrics go down on config reload.

yes, it's still hapenning https://grafana.wikimedia.org/goto/SHdP6s3HR?orgId=1:

image.png (1×1 px, 72 KB)

I believe this will be fixed when we upgrade to HAProxy 3.0 given it provides persistent stats.

Fri, Oct 3, 2:29 PM · SRE, observability, Traffic
ssingh added a comment to T398588: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension).

@Shisma: Hi, this still needs a Wikimedia affiliate approval as per https://lists.wikimedia.org/pipermail/maps-l/2020-August/001729.html. Can you provide one, and after which we will need to check the technical feasibility?

Fri, Oct 3, 2:20 PM · Traffic, Maps, SRE
ssingh added a comment to T343000: HAProxy metrics go down on config reload.

Can someone help me double-check if this is still a problem? I don't see it in the dashboards above, selecting a more recent time interval.

Fri, Oct 3, 2:19 PM · SRE, observability, Traffic

Thu, Oct 2

ssingh added a comment to T406141: Disable LVS paging for WDQS.

I'm not entirely sure which alerts paged during the last WDQS outage. I think it was about the number of servers depooled from LVS and not from error rate. The currently linked gerrit patch removes pages in cases of high number of backend errors, but I don't think it removes the page in case of too many servers depooled.

Thu, Oct 2, 1:30 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Traffic
ssingh updated subscribers of T404959: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad.

@BCornwall from Traffic will be working on this, thanks!

Thu, Oct 2, 1:27 PM · DC-Ops, ops-eqiad, Traffic, Infrastructure-Foundations, netops, SRE

Wed, Oct 1

ssingh added a comment to T400952: Setting up Wikimedia Trust and Safety Help Center with Zendesk product: Seeking Guidance on host mapping .

@BCornwall from Traffic is handling this and will sync with @JAbrams and @jrbs for when the change is made live, to ensure incoming and outgoing email works as expected.

Wed, Oct 1, 7:03 PM · DNS, Traffic
ssingh added a project to T406167: codfw: 2 VM request for hCaptcha: Traffic.
Wed, Oct 1, 5:48 PM · Traffic, vm-requests, Infrastructure-Foundations, SRE
ssingh added a parent task for T406166: eqiad: 2 VM request for hCaptcha: Unknown Object (Task).
Wed, Oct 1, 5:48 PM · Traffic, vm-requests, Infrastructure-Foundations, SRE
ssingh added a parent task for T406167: codfw: 2 VM request for hCaptcha: Unknown Object (Task).
Wed, Oct 1, 5:47 PM · Traffic, vm-requests, Infrastructure-Foundations, SRE
ssingh created T406167: codfw: 2 VM request for hCaptcha.
Wed, Oct 1, 5:46 PM · Traffic, vm-requests, Infrastructure-Foundations, SRE
ssingh added a project to T406166: eqiad: 2 VM request for hCaptcha: Traffic.
Wed, Oct 1, 5:45 PM · Traffic, vm-requests, Infrastructure-Foundations, SRE
ssingh created T406166: eqiad: 2 VM request for hCaptcha.
Wed, Oct 1, 5:45 PM · Traffic, vm-requests, Infrastructure-Foundations, SRE
ssingh claimed T406141: Disable LVS paging for WDQS.
Wed, Oct 1, 4:19 PM · Essential-Work, Data-Platform-SRE (2025.09.26 - 2025.10.17), Traffic

Tue, Sep 30

ssingh added a comment to T394789: Validate pybal config in CI.

Thanks @ssingh , that's completely understandable. If PyBal is going away, it's probably not worth the effort to fix it.

But I'm wondering if Liberica has the same issue? In other words, is it possible to create an invalid Liberica config via Puppet patches, that would be caught by the ConfdResourceFailed alerts? If so, then the task is still valid, it just needs to change scope. Let me know what you think.

Tue, Sep 30, 2:22 PM · Traffic, Data-Platform-SRE

Mon, Sep 29

ssingh added a comment to T404219: 403 errors with user-agent.

I admittedly don't know enough of what OpenRefine is doing under the hood to get the query path, but I have the json for the reconciliation call if that helps. I can also share the python script that's erroring..

Mon, Sep 29, 8:12 PM · Traffic
ssingh added a comment to T404219: 403 errors with user-agent.

Hey there, sorry about the delay, I was away from the office for a bit. The folks at T402959 don't think this issue is due to the SPARQL queries, but I just checked and I'm still having the same issues as I was a couple weeks ago.

Mon, Sep 29, 7:16 PM · Traffic
ssingh added a comment to T405165: Fetching mediawiki GPG keys fail with error "No data" due to User-Agent requirement.
El T405165#11225510, @ssingh escribió:

I don't think this has anything to do with the UA. Is the command correct? Maybe try downloading via curl/wget and then piping to gpg --import '. curl https://www.mediawiki.org/keys/keys.txt | gpg --import - works for me for example.

I used tcpdump to inspect the traffic (using http instead of https URL) on both Debian and OpenSuSE, and no User-Agent was sent on Debian (see description).
I later confirmed the user-agent issue using curl to request the same URL, with the default user agent and also explicitly removing the user agent, with consistent results (failure when no user-agent was present)

How easy/difficult is to whitelist a specific path or URL? It was working before your (WMF) changes, requesting a particular static txt file shouldn't cause any significant amount of pressure on the servers, and retrieving it from gpg directly is very common. Debugging such a cryptic error message can be time consuming.

Mon, Sep 29, 6:20 PM · Patch-For-Review, Traffic
ssingh updated subscribers of T405942: eqiad row C/D Data Persistence host migrations.

Note that @KOfori is out, this should be directed to @Kappakayala in the meantime.

Mon, Sep 29, 4:41 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad
ssingh added a comment to T405165: Fetching mediawiki GPG keys fail with error "No data" due to User-Agent requirement.

I don't think this has anything to do with the UA. Is the command correct? Maybe try downloading via curl/wget and then piping to gpg --import '. curl https://www.mediawiki.org/keys/keys.txt | gpg --import - works for me for example.

Mon, Sep 29, 4:20 PM · Patch-For-Review, Traffic
ssingh removed a project from T399674: Requests fail with Access-Control-Allow-Origin errors when using ForeignApi on iOS Safari: Traffic.

[Removing Traffic since this is a MW issue. Please re-add if you disagree.]

Mon, Sep 29, 3:46 PM · MobileFrontend (Tracking), MW-Interfaces-Team, MediaWiki-Action-API, JavaScript
ssingh added a comment to T310009: Make it easier to create a new requestctl object.

I am curious: should we keep this open or should this be resolved now given that we have requestctl.wikimedia.org and that takes care of most of the pain points?

Mon, Sep 29, 3:44 PM · Traffic, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), conftool
ssingh closed T287847: Performance implications of buffer sizes in Apache Traffic Server intercept plugins as Resolved.

This was merged upstream in 9.2.x so we have inherited this change. Since we have not revisited this since 2021 but can pursue if required, I am taking the liberty to mark this as resolved.

Mon, Sep 29, 3:38 PM · Traffic, SRE
ssingh closed T274431: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467, a subtask of T252132: Deploy Wikimedia DNS: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver, as Resolved.
Mon, Sep 29, 3:36 PM · User-notice, Epic, SRE, Traffic
ssingh closed T274431: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 as Resolved.

We have had this for a while and the responses are padded. Marking as resolved.

Mon, Sep 29, 3:36 PM · SRE, Traffic
ssingh added a comment to T368694: ncmonitor should not submit new CRs if there are still some yet to be reviewed.

The goal is to have ncmonitor run much more frequently than it currently is (e.g. once a day rather than once a week) so that we can accelerate the turnaround time for when new domains are registered. If ncmonitor were to run once a day right now then we'd end up with daily CRs of the same patchset. By bailing out we give the team some grace to review rather than have to deal with a small pile of the same patch.

Mon, Sep 29, 3:32 PM · Patch-For-Review, Traffic
ssingh added a comment to T315911: ATS Read While Writer feature is wrongly configured.

What is the update on this, given that it has been a while and I am a bit confused reading the text and trying to follow the CRs.

Mon, Sep 29, 3:29 PM · Wikimedia-Performance-recommendation, Patch-For-Review, SRE, Traffic
ssingh added a comment to T294800: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts.

And to be clear, by that I mean that this change is better suited for MW and not the CDN.

Mon, Sep 29, 3:04 PM · Traffic, serviceops, SRE
ssingh moved T294800: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts from Backlog to Radar/Not for Service on the Traffic board.

This is being moved on the Traffic workboard to "Radar/Not for service" as I don't think there is anything on our end to do here. Please let me know if you disagree and adjust accordingly.

Mon, Sep 29, 3:04 PM · Traffic, serviceops, SRE
ssingh assigned T399688: varnish wikimedia_trust ACL isn't used anymore to BCornwall.
Mon, Sep 29, 2:58 PM · Patch-For-Review, Traffic
ssingh added a comment to T368694: ncmonitor should not submit new CRs if there are still some yet to be reviewed.

My two cents: I think like you mention above, is this really required? I think this may be more work and doesn't give us a tangible benefit? Put differently I guess, what is the concern with multiple CRs?

Mon, Sep 29, 2:56 PM · Patch-For-Review, Traffic
ssingh moved T404219: 403 errors with user-agent from Actively Servicing to Backlog on the Traffic board.

Awaiting a response from @Lupascriptix before we can triage it again.

Mon, Sep 29, 2:54 PM · Traffic
ssingh closed T399947: ncredir sometimes receives large traffic spikes leading to unavailability as Resolved.

Paging has been disabled on this alert since July 2025. But even other than that, looking at the 90-day view, this has not happened in a while.

Mon, Sep 29, 2:42 PM · Traffic
ssingh closed T367056: Rise in ms-fe2* TCP retransmits since 11:40 UTC today as Resolved.

OK thank you. I am marking this as resolved for now. We can re-open as required.

Mon, Sep 29, 2:40 PM · Traffic, SRE, SRE-swift-storage
ssingh moved T117618: Add restrictive CSP to upload.wikimedia.org from Backlog to Radar/Not for Service on the Traffic board.
Mon, Sep 29, 2:38 PM · Patch-For-Review, Traffic, ContentSecurityPolicy, WMF-General-or-Unknown, Security-Team
ssingh added a comment to T117618: Add restrictive CSP to upload.wikimedia.org.

Commenting from Traffic's side: this is in some ways, a trivial patch for us because we are simply setting an additional header. The challenge here, though, is understanding the header itself and the associated ramifications of setting it and also keeping it updated. For that, the Security should be/needs to be consulted, so this patch currently blocks on that happening.

Mon, Sep 29, 2:37 PM · Patch-For-Review, Traffic, ContentSecurityPolicy, WMF-General-or-Unknown, Security-Team
ssingh added a comment to T400324: Consider using Intel Xeon CPUs with QAT.

The new cp hosts in codfw, Intel(R) Xeon(R) 6730P, support QAT, so we can experiment there and see if that helps.

Mon, Sep 29, 2:36 PM · Traffic
ssingh triaged T401025: Investigate setting init_on_alloc=0 on cache hosts as Low priority.

This is worth investigating/researching but I am marking this as "Low" and we can revisit it for Q3 2025-2026.

Mon, Sep 29, 2:16 PM · Traffic
ssingh closed T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps, a subtask of T346722: Sao Paulo, Brazil, South America POP tracking task, as Resolved.
Mon, Sep 29, 2:11 PM · ops-magru
ssingh closed T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps as Resolved.

We have made progress in T301605, and specific to this task, we ramped up traffic to magru already.

Mon, Sep 29, 2:11 PM · Infrastructure-Foundations, SRE, Traffic
ssingh added a comment to T367056: Rise in ms-fe2* TCP retransmits since 11:40 UTC today .

There has been no follow-up on this after Jun 2024. @MatthewVernon: should we keep this open?

Mon, Sep 29, 2:09 PM · Traffic, SRE, SRE-swift-storage
ssingh added a comment to T342154: Upgrade Traffic hosts to bookworm.

Because of the regressions observed in OpenSSL 3.x, we never finished the upgrade for the cp hosts to bookworm. This task is stalled while we go through the possible upgrade paths for either OpenSSL, or an upgrade to trixie itself, depending on the former.

Mon, Sep 29, 2:07 PM · Patch-For-Review, Traffic
ssingh closed T382790: "Backend fetch failed" on edit save as Resolved.

This has been open for a while and there hasn't been any follow up from either side. @MGChecker: Please re-open if this issue still persists for you. Thanks!

Mon, Sep 29, 2:02 PM · Traffic, SRE
ssingh added a comment to T389707: purged event lag keeps piling up in codfw topics after switchover.

@Fabfur: Is there any follow-up to do here or can we close this given the issue at hand was resolved? Note that in the recent switchover, we didn't observe this issue.

Mon, Sep 29, 2:01 PM · Traffic

Thu, Sep 25

ssingh moved T405623: eqiad row C/D Traffic host migrations from Backlog to Actively Servicing on the Traffic board.
Thu, Sep 25, 5:07 PM · Traffic, SRE, DC-Ops, ops-eqiad

Wed, Sep 24

ssingh updated ssingh.
Wed, Sep 24, 11:54 PM
ssingh closed T404974: [Search Console Verification DNS Request] - {{wikimediafoundation.org}} as Resolved.
Wed, Sep 24, 3:59 PM · Traffic, SRE
ssingh added a comment to T404974: [Search Console Verification DNS Request] - {{wikimediafoundation.org}}.

OK great, resolving this for now but yeah, let's add this to the docs and see how it serves us next time. Thanks and sorry for the delay. Adding Traffic to the tags as Nat has done now should prevent this from taking a week next time! :)

Wed, Sep 24, 3:58 PM · Traffic, SRE
ssingh added a comment to T404974: [Search Console Verification DNS Request] - {{wikimediafoundation.org}}.

Understood, thank you! Going forward, as part of the process for handling these requests, once the verification is completed for the domain and ITS can see the domain property in GSC, should we (ITS) go ahead and provide the TXT record in the phab ticket so that oktaservice@ can get domain ownership as well in GSC? I'm writing process docs for ITS on this type of request.

Wed, Sep 24, 3:57 PM · Traffic, SRE
ssingh added a comment to T404974: [Search Console Verification DNS Request] - {{wikimediafoundation.org}}.

Yes, I am using the ITS service account designated to grant others access to our various GSC properties. The account is oktaservice@, which Nat gave account owner permissions a few weeks ago.

Wed, Sep 24, 3:53 PM · Traffic, SRE
ssingh added a comment to T404974: [Search Console Verification DNS Request] - {{wikimediafoundation.org}}.

No problem. Here it is: google-site-verification=m7jEgoI4DOUy0u6cebxtp7oJT7s3nnNyPWgmPQmNEjc

Wed, Sep 24, 3:45 PM · Traffic, SRE
ssingh added a comment to T404974: [Search Console Verification DNS Request] - {{wikimediafoundation.org}}.

Hi @ssingh , I just tested this with our service account (I am part of ITS), and I am still seeing the window prompting me to verify domain ownership for wikimediafoundation.org (copy the TXT record, upload it to the DNS config for the domain). When I click Verify, it gives me an error still, saying "ownership verification failed." I assume this is because it may take a few hours for the changes you made to propagate, but I just wanted to check if this is expected behavior? Thank you again!

Wed, Sep 24, 3:42 PM · Traffic, SRE