process_recording_tagging_queue. The fix (§5.2): make the queue's two stateless filters race-resilient — drop the enqueue-time subscription_type check, replace the dequeue-time drop with re-queue-with-backoff. One mechanism, no sidecar. Empirical S3 probe (40 sampled sessions) confirmed ~95% of leaks still have audio in S3, so the fix also unblocks the S3 tag-for-deletion path that the queue race was bypassing.
A leaked session is a profiles_quransession row where:
has_audio = TRUEsession_type = 'RECITE' and is_deleted = FALSEUser.objects.with_premium_entitlement() (mirrored faithfully from
core-api@78e1344a per TA-4632, including grace and loyal-subscriber
exemptions)Restricted to sessions created on or after 2026-05-05 04:19 UTC — TA-4632's S3-inventory cutoff — so we measure only the live leak rate, not the historical residue (343 M sessions, separately addressed by TA-4670 Workstream 1).
Query (ClickHouse, executed against tarteeldb.public_*):
saved at /tmp/leak_validation_main.sql (embedded in §7). Entitlement CTE mirrors
basirah-lab:jobs/orphan_audio_audit/identify_users.sql verbatim except
that students_pg (PG-only export) is omitted — at worst slightly
overcounts likely_leaked by the count of active students (≪ 1% of users).
| Lapse day | Persistent race-failures | Note |
|---|---|---|
| 2026-05-26 (today) | 1,622 | Transient — trial ending today, will be cleaned by next 01:30 UTC cron run. |
| 2026-05-25 | 11 | Persistent |
| 2026-05-24 | 9 | Persistent |
| 2026-05-23 | 10 | Persistent |
| 2026-05-22 | 7 | Persistent |
| 2026-05-21 | 13 | Persistent |
| 2026-05-20 | 13 | Persistent |
| 2026-05-19 | 16 | Persistent (10 trial-driven, 6 sub-driven) |
| 5-18 → 5-06 | flat ~5–13/day | Persistent |
Per-week math: ~7.8 persistent race-failures/day × 7 days ≈ ~55 users/week × ~8–10 sessions/user ≈ ~450–550 sessions/week currently leaking past the queue's filters. With the Option C fix (§5.2), this drops to ~0/week steady-state (modulo the DLQ tail of persistently-flapping accounts).
6,175 unique users were counted leaked at the 2026-05-25 snapshot. The entitlement-history breakdown below is informative: ~all leaked users have a free-trial history, ~none lack any entitlement signal entirely — confirming the gate works at session-create and the bug is in the post-trial path.
| Category | Users | Share | Interpretation |
|---|---|---|---|
| Ever had a free trial (any time) | 6,162 | 99.8% | ~all leaked users have a trial somewhere in their history. Note: trials are nearly universal in Tarteel (auto-grant on first login), so this is closer to "active user" than "trial lapser". |
| Ever had a subscription | 108 | 1.7% | Tiny minority. Mostly old churners (6,073 subs expired > 90d ago across all leaked users — overlaps with "ever had a subscription" weakly). |
| · Subscription expired in last 30d | 102 | 1.7% | Within our measurement window — these are recent paid-tier lapsers. |
| · Subscription expired 30–90d ago | 0 | 0% | Suspiciously empty — likely a clustering artifact in the cohort. |
| · Subscription expired > 90d ago | 6,073 | 98.3%* | *Likely overlap with the "ever had trial" group: same user with both signals counted in both rows. |
| Ever had family-plan membership | 4 | 0.06% | Negligible — not a meaningful lapse path here. |
| Ever had alim beneficiary | 133 | 2.2% | Small. Likely access-revoked cases. |
| Never had ANY entitlement signal | 2 | ~0% | Almost zero. Means the PR #1564 entitlement gate is doing its job at session-create time — <1 per ten thousand sessions bypass it. |
has_audio=TRUE sessions for now-non-entitled accounts.
PR #1564's session-create gate is holding (only 2 users with zero entitlement signal). The bug is in the
post-trial path — TA-4634's exact diagnosis: trial expiry races / lost events in
profiles/tasks/user_management.py:92 failing to flip has_audio.
TA-4634 traced this to two compounding code-path facts in core-api:
has_audio at session create on entitlement
(if not user.should_store_audio: has_audio = False). The gate is holding: today, only ~2 of 6,175 leaked users had no entitlement signal ever.process_recording_tagging_queue is event-driven — it consumes a Redis queue of downgrade events. The lapse paths that don't enqueue an event leave the flag stuck at TRUE:
profiles/tasks/user_management.py:92) — enqueues with eligible_at = now(), but if anything filters the user out (e.g. they re-subscribed between trial-end and queue run, then unsubscribed), the flag isn't flipped. This is the dominant lapse path in our data.EXPIRED transitions that bypass post_subscription_downgrade_events (webhook dedup edge cases).process_recording_tagging_queue at all.
None of the 4 commits to core-api/main since TA-4634 was filed (2026-05-06) touch
profiles/tasks/user_data.py,
profiles/tasks/user_management.py, or the
_delete_audio_recordings path. Verified by:
git log --since=2026-05-06 --oneline origin/main -- \ profiles/tasks/user_data.py profiles/tasks/user_management.py \ profiles/models/users.py billing/ profiles/views/webhooks.py
Adjacent activity worth noting: Yūnus merged #1622 today ("Add event tracking for expired event") which instruments the billing expiration task but doesn't patch the lapse-pipeline gap. He may be staging diagnostic work in this area.
Reading the chart in §2.1 directly:
process_recording_tagging_queue dropped them due to the race TA-4634 identified. They stay leaked indefinitely until the queue is fixed (§5.2) — at which point the existing queue logic itself will reprocess any backlog naturally.process_recording_tagging_queue run. Most will clear normally; ~0.5% will fall into the race and join the persistent backlog tomorrow.The leak isn't a missing piece of plumbing — it's a race in the existing event-driven cleanup. Two stateless filters check "is the user currently entitled?" at different moments and silently drop the work if the answer is yes.
flowchart TD
T([Trial ends at time T])
T --> H["Hourly: downgrade_expired_trials"]
H --> H1{"user.subscription_type
== DEFAULT?"}
H1 -->|Yes| Q["queue_recording_tagging.delay()"]
H1 -->|"No (user briefly resubscribed
between T and hourly tick)"| X1[/"❌ DROPPED at enqueue
event never created"/]
Q --> P["Daily 01:30 UTC:
process_recording_tagging_queue"]
P --> P1{"user has
premium_entitlement?"}
P1 -->|No| Run["✅ tag S3 + flip has_audio"]
P1 -->|"Yes (user resubscribed
between enqueue and 01:30)"| X2[/"❌ DROPPED at dequeue
item removed from PROCESSING"/]
X1 --> Leak["Leaked:
has_audio TRUE,
S3 file remains,
user non-entitled"]
X2 --> Leak
Leak -.->|user later unsubscribes| NoEvt["No re-fire — past event is gone"]
NoEvt -.-> Leak
classDef race fill:#fce4d6,stroke:#c0392b,color:#c0392b
classDef leak fill:#fdf0e8,stroke:#c0392b,color:#c0392b
classDef ok fill:#d6efdb,stroke:#2c7a3e,color:#2c7a3e
class X1,X2 race
class Leak,NoEvt leak
class Run ok
Two distinct races:
downgrade_expired_trials only enqueues for users whose subscription_type=DEFAULT at the moment the hourly task runs. If the user re-subscribed (even briefly) between trial-end and the next hourly tick, no event is ever created. Cancelling later doesn't re-trigger the trial-end — the hourly window has moved on.process_recording_tagging_queue drops users who look entitled at 01:30 UTC. The item is removed from the PROCESSING list with no retry. If they cancel at 01:45, nothing re-fires.Legend: existing modified (behavior changes) new (added by this PR)
flowchart TD
T2([Trial ends at time T])
T2 --> H2["Hourly: downgrade_expired_trials
~ drop subscription_type filter ~"]
H2 --> Q2["queue_recording_tagging.delay()
called for every trial-end"]
Q2 --> P2["Daily 01:30 UTC:
process_recording_tagging_queue"]
P2 --> P2D{"user has
premium_entitlement?"}
P2D -->|No| Run2["tag S3 + flip has_audio
(_upload_manifest_and_start_batch_job
+ _delete_audio_recordings)"]
P2D -->|"Yes (entitled now)"| RQ["Re-queue with
attempts += 1
eligible_at = now + 24h"]
RQ --> Att{"attempts <
RECORDING_TAGGING_MAX_ATTEMPTS?"}
Att -->|Yes| P2
Att -->|No| DLQ["RECORDING_TAGGING_DLQ
for manual review"]
classDef existing fill:#f3f1ec,stroke:#c8c4b8,color:#1a1a1a
classDef changed fill:#fdf0e8,stroke:#b8651a,color:#b8651a,stroke-width:2px
classDef new fill:#d6efdb,stroke:#2c7a3e,color:#2c7a3e,stroke-width:2px
class T2,P2,P2D,Run2,DLQ existing
class H2,Q2 changed
class RQ,Att new
Nodes touched: 2 modified (drop one filter; the .delay() call now fires for every trial-end, not only DEFAULT users), 2 new (re-queue with backoff; attempts check). The DLQ destination already exists (RECORDING_TAGGING_DLQ Redis list); we just route to it on attempt exhaustion. The "tag S3 + flip has_audio" branch and the daily task are unchanged.
Two surgical changes, no new tasks, no new architecture:
| File | Change | Why |
|---|---|---|
profiles/tasks/user_management.py |
Drop the user__subscription_type=DEFAULT filter from the trial-end query. |
Always enqueue trial-ends. Brief re-subscribes no longer cause silent drops. |
profiles/tasks/user_data.py |
Replace "drop entitled-now users" with "re-queue with attempts+=1, eligible_at = now + 24h". DLQ after RECORDING_TAGGING_MAX_ATTEMPTS = 14 retries. |
Users who flap between entitled/non-entitled get re-evaluated on each cycle until their state stabilizes or the DLQ catches them. |
profiles/constants.py |
Add RECORDING_TAGGING_MAX_ATTEMPTS = 14. |
Cap retries at ~2 weeks. Beyond that, a human looks at it. |
profiles/tests/test_audio_grace_period.py |
Add race-scenario tests: enqueue while temporarily entitled; re-queue + DLQ paths. | Lock down the fix. |
The earlier safety-net approach (Option B — periodic reconciliation in a separate task) was implemented as TA-4917 / core-api#1624 and is open in draft. After review feedback that the reconciliation pattern is a band-aid over the actual race, the recommended path forward is Option C: close the draft, file a new ticket and PR with the surgical fix above.
_upload_manifest_and_start_batch_job step in process_recording_tagging_queue tags S3 files for deletion as it was always supposed to. The empirical S3 probe (40 sampled leaked sessions, ~95% still had audio in S3) confirmed the safety-net's flag-only approach would have left actual audio uncleaned — Option C avoids that gap by fixing the path that owns both flag-flip and S3-tag.
All queries were executed against ClickHouse Cloud (v12748f444.us-west-2.aws.clickhouse.cloud)
with the read-only khaled_ro credentials per the local clickhouse skill.
Saved alongside this report.
-- Leak validation: is the live-pipeline gap still active 20 days post-audit?
-- Mirrors identify_users.sql entitlement logic (omits students_pg).
-- See: TA-4634 (Done 2026-05-06), basirah-lab/jobs/orphan_audio_audit/identify_users.sql
WITH
now64() AS run_now,
toDateTime64('2026-05-05 04:19:00', 6, 'UTC') AS audit_cutoff,
entitled_users AS (
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0 AND status IN ('ACTIVE', 'GRACE', 'CANCELLED')
UNION DISTINCT
SELECT id AS user_id FROM tarteeldb.public_profiles_user FINAL
WHERE _peerdb_is_deleted = 0 AND subscription_type = 'PREMIUM'
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_profiles_freetrial FINAL
WHERE _peerdb_is_deleted = 0 AND started_at <= run_now AND ended_at > run_now
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_profiles_familymember FINAL
WHERE _peerdb_is_deleted = 0
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_alim_beneficiary FINAL
WHERE _peerdb_is_deleted = 0 AND access_revoked >= run_now
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0 AND start_date <= run_now - INTERVAL 2 YEAR
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0 AND status = 'EXPIRED' AND expired_at > run_now - INTERVAL 7 DAY
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscriptionevent FINAL
WHERE _peerdb_is_deleted = 0
AND type IN ('EXPIRED', 'CANCELLED')
AND created_at > run_now - INTERVAL 7 DAY
)
SELECT
toDate(created_at) AS day,
count() AS new_has_audio_sessions,
countIf(user_id NOT IN (SELECT user_id FROM entitled_users)) AS likely_leaked,
round(
countIf(user_id NOT IN (SELECT user_id FROM entitled_users))
/ nullIf(count(), 0) * 100,
1) AS leak_pct
FROM tarteeldb.public_profiles_quransession FINAL
WHERE _peerdb_is_deleted = 0
AND created_at >= audit_cutoff
AND has_audio = TRUE
AND session_type = 'RECITE'
AND is_deleted = FALSE
GROUP BY day
ORDER BY day DESC
FORMAT TabSeparatedWithNames
-- Composition of the leak: per "lapse path", how many leaked USERS have that history?
-- For each leaked user (currently non-entitled, has_audio=TRUE post-cutoff session),
-- classify what their last-known entitlement signal was.
WITH
now64() AS run_now,
toDateTime64('2026-05-05 04:19:00', 6, 'UTC') AS audit_cutoff,
entitled_users AS (
-- (same 8-branch UNION as Q1 — see above)
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0 AND status IN ('ACTIVE', 'GRACE', 'CANCELLED')
UNION DISTINCT
SELECT id AS user_id FROM tarteeldb.public_profiles_user FINAL
WHERE _peerdb_is_deleted = 0 AND subscription_type = 'PREMIUM'
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_profiles_freetrial FINAL
WHERE _peerdb_is_deleted = 0 AND started_at <= run_now AND ended_at > run_now
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_profiles_familymember FINAL
WHERE _peerdb_is_deleted = 0
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_alim_beneficiary FINAL
WHERE _peerdb_is_deleted = 0 AND access_revoked >= run_now
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0 AND start_date <= run_now - INTERVAL 2 YEAR
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0 AND status = 'EXPIRED' AND expired_at > run_now - INTERVAL 7 DAY
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscriptionevent FINAL
WHERE _peerdb_is_deleted = 0
AND type IN ('EXPIRED', 'CANCELLED')
AND created_at > run_now - INTERVAL 7 DAY
),
leaked_users AS (
SELECT DISTINCT user_id
FROM tarteeldb.public_profiles_quransession FINAL
WHERE _peerdb_is_deleted = 0
AND created_at >= audit_cutoff
AND has_audio = TRUE
AND session_type = 'RECITE'
AND is_deleted = FALSE
AND user_id NOT IN (SELECT user_id FROM entitled_users)
),
-- Per leaked user: do they have a record in each source?
sub_history AS (
SELECT DISTINCT user_id, max(expired_at) AS last_expired
FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0
AND user_id IN (SELECT user_id FROM leaked_users)
GROUP BY user_id
),
trial_history AS (
SELECT DISTINCT user_id, max(ended_at) AS last_ended
FROM tarteeldb.public_profiles_freetrial FINAL
WHERE _peerdb_is_deleted = 0
AND user_id IN (SELECT user_id FROM leaked_users)
GROUP BY user_id
),
family_history AS (
SELECT DISTINCT user_id
FROM tarteeldb.public_profiles_familymember FINAL
WHERE user_id IN (SELECT user_id FROM leaked_users)
),
alim_history AS (
SELECT DISTINCT user_id
FROM tarteeldb.public_alim_beneficiary FINAL
WHERE user_id IN (SELECT user_id FROM leaked_users)
)
SELECT
countDistinct(l.user_id) AS total_leaked_users,
countDistinctIf(l.user_id, sh.user_id != 0) AS ever_had_subscription,
countDistinctIf(l.user_id, sh.last_expired > run_now - INTERVAL 30 DAY) AS sub_expired_last_30d,
countDistinctIf(l.user_id, sh.last_expired BETWEEN run_now - INTERVAL 90 DAY AND run_now - INTERVAL 30 DAY) AS sub_expired_30d_90d,
countDistinctIf(l.user_id, sh.last_expired < run_now - INTERVAL 90 DAY) AS sub_expired_over_90d,
countDistinctIf(l.user_id, th.user_id != 0) AS ever_had_trial,
countDistinctIf(l.user_id, fh.user_id != 0) AS ever_had_family,
countDistinctIf(l.user_id, ah.user_id != 0) AS ever_had_alim,
countDistinctIf(l.user_id,
sh.user_id = 0 AND th.user_id = 0 AND fh.user_id = 0 AND ah.user_id = 0
) AS never_had_any_entitlement
FROM leaked_users l
LEFT JOIN sub_history sh ON sh.user_id = l.user_id
LEFT JOIN trial_history th ON th.user_id = l.user_id
LEFT JOIN family_history fh ON fh.user_id = l.user_id
LEFT JOIN alim_history ah ON ah.user_id = l.user_id
FORMAT Vertical
-- Hypothesis: the 2026-05-18 leak cliff is driven by a batch of subscription
-- EXPIRED events around 2026-05-11 (7 days earlier — grace period rolls off).
-- PR #1621 (Yūnus, 2026-05-25) fixed "Stripe renewal transactions being silently
-- dropped" + ships scripts/backfill_stripe_renewals.py. If that's the cause,
-- we should see an anomalous spike of EXPIRED events around 2026-05-11.
WITH
now64() AS run_now
SELECT
toDate(expired_at) AS expired_day,
count() AS expired_subscriptions,
-- Compare to active count for context
countIf(status = 'EXPIRED') AS now_status_expired,
countIf(status = 'ACTIVE') AS now_status_active,
countIf(status = 'GRACE') AS now_status_grace,
countIf(status = 'CANCELLED') AS now_status_cancelled
FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0
AND expired_at >= run_now - INTERVAL 30 DAY
AND expired_at <= run_now
GROUP BY expired_day
ORDER BY expired_day DESC
FORMAT TabSeparatedWithNames
-- Free trials ending by day for last 30 days. Looking for cohort spike that
-- explains the 2026-05-18 leak cliff. Trials have no grace period per the audit
-- (profiles/tasks/user_management.py:92), so trial-end → immediate "non-entitled".
WITH
now64() AS run_now
SELECT
toDate(ended_at) AS trial_ended_day,
count() AS trials_ending,
-- For these trials, were they actually consumed (started_at < ended_at)?
countIf(started_at < ended_at) AS started_before_end,
-- Average trial length in days
round(avg(dateDiff('day', started_at, ended_at)), 1) AS avg_trial_days
FROM tarteeldb.public_profiles_freetrial FINAL
WHERE _peerdb_is_deleted = 0
AND ended_at >= run_now - INTERVAL 30 DAY
AND ended_at <= run_now
GROUP BY trial_ended_day
ORDER BY trial_ended_day DESC
FORMAT TabSeparatedWithNames
WITH
now64() AS run_now,
toDateTime64('2026-05-05 04:19:00', 6, 'UTC') AS audit_cutoff,
entitled_users AS (
-- (same 8-branch UNION as Q1 — see above)
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0 AND status IN ('ACTIVE', 'GRACE', 'CANCELLED')
UNION DISTINCT
SELECT id AS user_id FROM tarteeldb.public_profiles_user FINAL
WHERE _peerdb_is_deleted = 0 AND subscription_type = 'PREMIUM'
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_profiles_freetrial FINAL
WHERE _peerdb_is_deleted = 0 AND started_at <= run_now AND ended_at > run_now
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_profiles_familymember FINAL
WHERE _peerdb_is_deleted = 0
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_alim_beneficiary FINAL
WHERE _peerdb_is_deleted = 0 AND access_revoked >= run_now
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0 AND start_date <= run_now - INTERVAL 2 YEAR
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscription FINAL
WHERE _peerdb_is_deleted = 0 AND status = 'EXPIRED' AND expired_at > run_now - INTERVAL 7 DAY
UNION DISTINCT
SELECT DISTINCT user_id FROM tarteeldb.public_billing_subscriptionevent FINAL
WHERE _peerdb_is_deleted = 0
AND type IN ('EXPIRED', 'CANCELLED')
AND created_at > run_now - INTERVAL 7 DAY
)
SELECT
toDate(created_at) AS day,
count() AS leaked_sessions,
countDistinct(user_id) AS leaked_users,
round(count() / nullIf(countDistinct(user_id), 0), 2) AS avg_sessions_per_user
FROM tarteeldb.public_profiles_quransession FINAL
WHERE _peerdb_is_deleted = 0
AND created_at >= audit_cutoff
AND has_audio = TRUE
AND session_type = 'RECITE'
AND is_deleted = FALSE
AND user_id NOT IN (SELECT user_id FROM entitled_users)
GROUP BY day
ORDER BY day DESC
FORMAT TabSeparatedWithNames
/tmp/leak_validation_main.sql, /tmp/leak_composition.sql, /tmp/leak_expiration_pattern.sql, /tmp/leak_trial_pattern.sql, /tmp/leak_concentration.sql, /tmp/leak_rca_correlation.sql, /tmp/leak_rca_once_per_user.sql, /tmp/leak_rca_q3_fast.sqlbasirah-lab:jobs/orphan_audio_audit/identify_users.sql on branch feat/ta-4632-orphan-recordings-manifest.