Initial commit: Twitter-Telegram Profile Matching System

This module provides comprehensive Twitter-to-Telegram profile matching and verification using 10 different matching methods and LLM verification. Features: - 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names) - URL resolution integration for t.co → t.me links - Async LLM verification with GPT-5-mini - Interactive menu system with real-time stats - Threaded candidate finding (~1.5 contacts/sec) - Comprehensive documentation and guides Key Components: - find_twitter_candidates.py: Core matching logic (10 methods) - find_twitter_candidates_threaded.py: Threaded implementation - verify_twitter_matches_v2.py: LLM verification (V5 prompt) - review_match_quality.py: Analysis and quality review - main.py: Interactive menu system - Complete documentation (README, CHANGELOG, QUICKSTART) Performance: - Candidate finding: ~16-18 hours for 43K contacts - LLM verification: ~23 hours for 43K users - Cost: ~$130 for full verification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-12 09:44:30 +08:00 · 2025-11-04 22:56:25 -08:00
commit 5319d4d868
10 changed files with 3394 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,47 @@
 # Environment variables
 .env
 .env.local
 .env.*.local
 # Python
 __pycache__/
 *.py[cod]
 *$py.class
 *.so
 .Python
 env/
 venv/
 ENV/
 build/
 dist/
 *.egg-info/
 # IDE
 .vscode/
 .idea/
 *.swp
 *.swo
 *~
 # Logs
 *.log
 *.out
 *.pid
 # Database
 *.db
 *.sqlite
 *.sqlite3
 # Checkpoints and temp files
 *_checkpoint.json
 *.tmp
 *.temp
 # OS
 .DS_Store
 Thumbs.db
 # Test outputs
 test_output/
 *.test
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -0,0 +1,159 @@
 # ProfileMatching Changelog
 ## 2025-11-04 - Initial Module Creation & URL Resolution Integration
 ### Module Organization
 - Created dedicated `ProfileMatching/` folder within UnifiedContacts
 - Bundled all matching and verification scripts together
 - Added comprehensive interactive `main.py` with menu system
 - Added detailed `README.md` documentation
 ### Major Enhancement: URL Resolution Integration
 **Problem**: Missing matches where Twitter usernames differ from Telegram usernames, even when Twitter profile explicitly links to Telegram.
 **Solution**: Integrated `url_resolution_queue` table data into candidate finding.
 **Implementation**:
 - Added new method `find_by_resolved_url()` to TwitterMatcher class (find_twitter_candidates.py:339-378)
 - Queries Twitter profiles where bio URLs (t.co/xyz) resolve to t.me/{username}
 - Integrated as Method 5b in candidate finding pipeline (find_twitter_candidates.py:554-574)
 - Baseline confidence: 0.95 (very high - explicit link in bio)
 **Impact**:
 - 140+ potential new matches identified
 - Captures matches like Twitter @Block_Flash → Telegram @bull_flash
 - Especially valuable when usernames differ but user explicitly links profiles
 **Example Match**:
 ```
 Twitter: @Block_Flash (ID: 63429948)
 Telegram: @bull_flash
 Method: twitter_bio_url_resolves_to_telegram
 Confidence: 0.95
 Resolved URL: https://t.me/bull_flash
 Original URL: https://t.co/dc3iztSG9B
 ```
 ### URL Resolution Data Source
 The `url_resolution_queue` table contains:
 - 16,133 Twitter profiles with resolved Telegram URLs
 - Shortened URLs from Twitter bios (t.co/xyz)
 - Resolved destinations (full URLs)
 - Extracted Telegram handles (JSONB array)
 ### Files Modified
 1. **find_twitter_candidates.py**
   - Added `find_by_resolved_url()` method
   - Integrated into `find_candidates_for_contact()` as Method 5b
   - Fixed type casting for twitter_id (VARCHAR) to user.id (BIGINT) join
 2. **main.py** (ProfileMatching module)
   - Created comprehensive interactive menu
   - Real-time statistics display
   - Streamlined workflow for candidate finding and LLM verification
 3. **README.md**
   - Complete documentation of all 10 matching methods
   - Usage examples and performance metrics
   - Configuration and troubleshooting guides
 4. **UnifiedContacts/main.py**
   - Added option 15: "Open Profile Matching System (Interactive Menu)"
   - Reorganized menu to separate ProfileMatching from individual Twitter steps
 ### Current Statistics (as of 2025-11-04)
 **Candidates**:
 - Users with candidates: 38,121
 - Total candidates found: 253,117
 - Processed by LLM: 253,085
 - Pending verification: 32
 **Verified Matches**:
 - Users with matches: 25,662
 - Total matches: 36,147
 - Average confidence: 0.74
 - High confidence (90%+): 12,031
 - Medium confidence (80-89%): 1,452
 - Low confidence (70-79%): 7,505
 ### LLM Verification (V6 Prompt)
 Current prompt improvements:
 - Upfront directive for comparative evaluation
 - Clear signal strength hierarchy (Very Strong, Strong Supporting, Weak, Red Flags)
 - Company vs personal account differentiation
 - Streamlined from ~135 to ~90 lines while being clearer
 - Emphasis on evaluating ALL candidates together
 ### Performance Metrics
 **Candidate Finding (Threaded)**:
 - Speed: ~1.5 contacts/sec
 - Time for 43K contacts: ~16-18 hours
 - Workers: 8 (default)
 **LLM Verification (Async)**:
 - Speed: ~32 users/minute (100 concurrent requests)
 - Cost: ~$0.003 per user (GPT-5-mini)
 - Time for 43K users: ~23 hours
 ### Module Structure
 ```
 ProfileMatching/
 ├── main.py                              # Interactive menu system
 ├── README.md                            # Complete documentation
 ├── CHANGELOG.md                         # This file
 ├── find_twitter_candidates.py           # Core matching logic (10 methods)
 ├── find_twitter_candidates_threaded.py  # Threaded implementation
 ├── verify_twitter_matches_v2.py         # LLM verification (V6 prompt)
 ├── review_match_quality.py              # Analysis tools
 └── setup_twitter_matching_schema.sql    # Database schema
 ```
 ### 10 Matching Methods Summary
 1. **Phash Match** (0.95/0.88) - Profile picture similarity
 2. **Exact Bio Handle** (0.95) - Twitter handle extracted from Telegram bio
 3. **Bio URL Resolution** (0.95) ⭐ NEW - Shortened URL resolves to Telegram
 4. **Twitter Bio Has Telegram** (0.92) - Twitter bio mentions Telegram username
 5. **Display Name Containment** (0.92) - TG name in TW name
 6. **Exact Username** (0.90) - Usernames match exactly
 7. **TG Username in Twitter Name** (0.88)
 8. **Twitter Username in TG Name** (0.86)
 9. **Fuzzy Name** (0.65-0.85) - Trigram similarity
 10. **Username Variation** (0.80) - Generated username variations
 ### Testing
 All changes tested with:
 - Standalone method testing (find_by_resolved_url)
 - Full integration testing (find_candidates_for_contact)
 - Verified deduplication works correctly
 - Confirmed matches with different usernames are captured
 ### Next Steps
 Potential future enhancements:
 - Add more matching methods (location, bio keywords, mutual connections)
 - Implement feedback loop for prompt improvement
 - Add manual review interface for borderline matches
 - Export matches to various formats
 - Additional URL resolution sources beyond Twitter bios
 ### Migration Notes
 **For existing deployments**:
 1. No database schema changes required
 2. Existing `url_resolution_queue` table is used as-is
 3. Scripts in `scripts/` folder remain unchanged and functional
 4. New ProfileMatching module is additive, doesn't break existing workflows
 **To use new features**:
 1. Use ProfileMatching/main.py instead of individual scripts
 2. Or run scripts directly from ProfileMatching folder
 3. Or update import paths to use ProfileMatching module
--- a/QUICKSTART.md
+++ b/QUICKSTART.md
@@ -0,0 +1,218 @@
 # ProfileMatching Quick Start Guide
 ## 🚀 Fastest Way to Get Started
 ### Option 1: Interactive Menu (RECOMMENDED)
 ```bash
 cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
 python3.10 main.py
 ```
 This gives you an interactive menu with:
 - Real-time statistics
 - Guided workflow
 - Easy access to all features
 ### Option 2: Launch from Main UnifiedContacts Menu
 ```bash
 cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts
 python3.10 main.py
 # Select option 15: "Open Profile Matching System"
 ```
 ## 📋 Typical Workflow
 ### Step 1: Find Candidates (First Time)
 If you haven't found candidates yet, run:
 ```bash
 # From ProfileMatching folder
 python3.10 find_twitter_candidates_threaded.py --workers 8
 # Or use interactive menu: Option 1
 ```
 **Expected time**: ~16-18 hours for all 43K contacts
 ### Step 2: Verify with LLM
 After candidates are found, verify them:
 ```bash
 # From ProfileMatching folder
 python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
 # Or use interactive menu: Option 3
 ```
 **Expected time**: ~23 hours for all users
 **Cost**: ~$130 for all users (GPT-5-mini at $0.003/user)
 ### Step 3: Review Results
 ```bash
 # From ProfileMatching folder
 python3.10 review_match_quality.py
 # Or use interactive menu: Option 5
 ```
 ## 🧪 Test Mode (Recommended Before Full Run)
 Always test with a small batch first:
 ```bash
 # Test with 50 users
 python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
 # Or use interactive menu: Option 4
 ```
 This helps you:
 - Verify the system is working correctly
 - Check match quality before spending on full run
 - Estimate costs and timing
 ## 📊 Check Current Status
 At any time, you can check where you're at:
 ```bash
 # Launch interactive menu and select Option 6: "Show statistics only"
 python3.10 main.py
 # Press 6, then 0 to exit
 ```
 Or query directly:
 ```bash
 psql -d telegram_contacts -U andrewjiang -c "
 SELECT
  COUNT(DISTINCT telegram_user_id) as users_with_candidates,
  COUNT(*) as total_candidates,
  COUNT(*) FILTER (WHERE llm_processed = TRUE) as processed,
  COUNT(*) FILTER (WHERE llm_processed = FALSE) as pending
 FROM twitter_match_candidates;
 "
 ```
 ## 🔄 Re-Running After Updates
 If you've updated the LLM prompt or matching logic:
 ### Re-find Candidates (if matching logic changed)
 ```bash
 # Delete old candidates
 psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_match_candidates CASCADE;"
 # Re-run candidate finding
 python3.10 find_twitter_candidates_threaded.py --workers 8
 ```
 ### Re-verify with New Prompt (if only prompt changed)
 ```bash
 # Reset LLM processing flag
 psql -d telegram_contacts -U andrewjiang -c "UPDATE twitter_match_candidates SET llm_processed = FALSE;"
 # Delete old matches
 psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_telegram_matches;"
 # Re-run verification
 python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
 ```
 ## 🎯 Most Common Commands
 ### Find candidates for first 1000 contacts (testing)
 ```bash
 python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
 ```
 ### Verify matches for pending candidates
 ```bash
 python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
 ```
 ### Check match quality distribution
 ```bash
 python3.10 review_match_quality.py
 ```
 ### Export matches to CSV (coming soon)
 ```bash
 # Will be added in future update
 ```
 ## 💡 Pro Tips
 1. **Always use threaded candidate finding** - It's 10-20x faster
 2. **Use high concurrency for verification** - 100-200 concurrent requests for optimal speed
 3. **Test first** - Always run with `--test --limit 50` before full runs
 4. **Monitor costs** - Check OpenAI dashboard during verification
 5. **Check the stats** - Use Option 6 in interactive menu to monitor progress
 ## 🐛 Troubleshooting
 ### "No candidates found"
 - Check if Twitter database has data: `psql -d twitter_data -c "SELECT COUNT(*) FROM users;"`
 - Check Telegram contacts: `psql -d telegram_contacts -c "SELECT COUNT(*) FROM contacts;"`
 ### "LLM verification is slow"
 - Increase `--concurrent` parameter (try 150-200)
 - Check OpenAI rate limits in dashboard
 - Verify network connection
 ### "Too many low-quality matches"
 - Review the V6 prompt in `verify_twitter_matches_v2.py`
 - Run `review_match_quality.py` to analyze
 - Consider adjusting confidence thresholds
 ### "Missing obvious matches"
 - Check if candidate was found:
  ```sql
  SELECT * FROM twitter_match_candidates WHERE telegram_user_id = YOUR_USER_ID;
  ```
 - If found but not verified, check `llm_verdict` field for reasoning
 - If not found at all, may need new matching method
 ## 📚 More Information
 - See `README.md` for complete documentation
 - See `CHANGELOG.md` for recent updates
 - See individual script files for command-line options
 ## 🆘 Need Help?
 Common issues and solutions:
 | Issue | Solution |
 |-------|----------|
 | Import errors | Make sure you're using python3.10 |
 | Database connection errors | Check PostgreSQL is running: `pg_isready` |
 | OpenAI API errors | Verify API key in `.env` file |
 | Out of memory | Reduce concurrent requests or use batching |
 ## 🎓 Understanding the Output
 ### Candidate Finding Output
 ```
 Processing contact 1000/43000 (2.3%)
 Found 6 candidates for @username
  • exact_username: @username (0.90)
  • fuzzy_name: @similar_name (0.75)
 ```
 ### LLM Verification Output
 ```
 [Progress] 500/1000 users (50.0%) | 125 matches | ~$1.50 | 25.0 users/min
 ```
 ### Match Quality Review
 ```
 Total users with matches: 25,662
 Total matches: 36,147
 Average confidence: 0.74
 Confidence Distribution:
  90%+: 12,031 matches (HIGH)
  80-89%: 1,452 matches (MEDIUM)
  70-79%: 7,505 matches (LOW)
 ```
--- a/README.md
+++ b/README.md
@@ -0,0 +1,207 @@
 # Twitter-Telegram Profile Matching System
 A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.
 ## Overview
 This system operates in two main steps:
 1. **Candidate Finding**: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
 2. **LLM Verification**: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)
 ## Quick Start
 ```bash
 cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
 python3.10 main.py
 ```
 ## Matching Methods
 The system uses 10 different methods to find Twitter candidates:
 ### High Confidence Methods (0.90-0.95)
 1. **Phash Match** (0.95 for exact, 0.88 for distance=1)
   - Compares profile picture hashes
   - Pre-computed in `telegram_twitter_phash_matches` table
 2. **Exact Bio Handle** (0.95)
   - Extracts Twitter handles from Telegram bio
   - Patterns: `@username`, `twitter.com/username`, `x.com/username`
 3. **Bio URL Resolution** (0.95) ⭐ NEW
   - Twitter bio contains shortened URL (t.co/xyz) that resolves to `t.me/username`
   - Queries `url_resolution_queue` table
   - Captures matches even when usernames differ
 4. **Twitter Bio Has Telegram** (0.92)
   - Reverse lookup: Twitter bio mentions Telegram username
   - Searches for `@username`, `t.me/username`, `telegram.me/username`
 5. **Display Name Containment** (0.92)
   - Telegram name contained within Twitter display name
 6. **Exact Username** (0.90)
   - Telegram username exactly matches Twitter username
 ### Medium Confidence Methods (0.80-0.88)
 7. **TG Username in Twitter Name** (0.88)
 8. **Twitter Username in TG Name** (0.86)
 9. **Fuzzy Name** (0.65-0.85)
   - PostgreSQL trigram similarity with 0.65 threshold
 10. **Username Variation** (0.80)
    - Generates variations (remove underscores, flip numbers, etc.)
 ## LLM Verification
 The system uses GPT-5-mini with a sophisticated V6 prompt that:
 - Evaluates ALL candidates together (comparative evaluation)
 - Applies differential scoring (only one can be "most likely")
 - Distinguishes between personal and company accounts
 - Considers signal strength holistically
 - Only saves matches with 70%+ confidence
 ## Files
 ### Core Scripts
 - `main.py` - Interactive menu for running the system
 - `find_twitter_candidates.py` - Core matching logic (TwitterMatcher class)
 - `find_twitter_candidates_threaded.py` - Threaded implementation (RECOMMENDED)
 - `verify_twitter_matches_v2.py` - LLM verification with async (RECOMMENDED)
 - `review_match_quality.py` - Analyze match quality and statistics
 ### Database Schema
 - `setup_twitter_matching_schema.sql` - Database tables and indexes
 ## Database Tables
 ### `twitter_match_candidates`
 Stores all potential matches found by the matching methods.
 **Key fields:**
 - `telegram_user_id` - Telegram contact user ID
 - `twitter_id` - Twitter profile ID
 - `match_method` - Which method found this candidate
 - `baseline_confidence` - Initial confidence (0.0-1.0)
 - `match_signals` - JSON with match details
 - `llm_processed` - Whether LLM has evaluated this candidate
 ### `twitter_telegram_matches`
 Stores verified matches (70%+ confidence from LLM).
 **Key fields:**
 - `telegram_user_id` - Telegram contact
 - `twitter_id` - Matched Twitter profile
 - `final_confidence` - LLM-assigned confidence (0.70-1.0)
 - `llm_verdict` - LLM reasoning
 - `match_method` - Original matching method
 - `matched_at` - Timestamp
 ### `url_resolution_queue`
 Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).
 **Key fields:**
 - `twitter_id` - Twitter profile ID
 - `original_url` - Shortened URL (e.g., t.co/abc)
 - `resolved_url` - Full URL (e.g., https://t.me/username)
 - `telegram_handles` - Extracted Telegram handles (JSONB array)
 ## Usage Examples
 ### Find Candidates for All Contacts (Threaded)
 ```bash
 python3.10 find_twitter_candidates_threaded.py --workers 8
 ```
 ### Find Candidates for First 1000 Contacts
 ```bash
 python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
 ```
 ### Verify Matches with LLM (100 concurrent requests)
 ```bash
 python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
 ```
 ### Test Mode (50 users, 10 concurrent)
 ```bash
 python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
 ```
 ### Review Match Quality
 ```bash
 python3.10 review_match_quality.py
 ```
 ## Performance
 ### Candidate Finding (Threaded)
 - **Speed**: ~1.5 contacts/sec
 - **Time for 43K contacts**: ~16-18 hours
 - **Workers**: 8 (default, configurable)
 ### LLM Verification (Async)
 - **Speed**: ~32 users/minute with 100 concurrent requests
 - **Cost**: ~$0.003 per user (GPT-5-mini)
 - **Time for 43K users**: ~23 hours
 ## Recent Improvements
 ### V6 Prompt (Latest)
 - Upfront directive for comparative evaluation
 - Clear signal strength hierarchy
 - Company vs personal account differentiation
 - Streamlined from ~135 to ~90 lines while being clearer
 ### URL Resolution Integration
 - Added Method 5b: Bio URL resolution
 - Captures 140+ additional matches
 - Especially valuable when usernames differ
 - 0.95 baseline confidence (very high)
 ## Configuration
 Environment variables (in `/Users/andrewjiang/Bao/TimeToLockIn/Profile/.env`):
 ```
 OPENAI_API_KEY=your_key_here
 OPENAI_MODEL=gpt-5-mini
 ```
 Database connections:
 - `telegram_contacts` - Telegram contact data
 - `twitter_data` - Twitter profile data
 ## Tips
 1. **Always run threaded candidate finding** - 10-20x faster than single-threaded
 2. **Use high concurrency for LLM verification** - 100+ concurrent requests for optimal speed
 3. **Monitor costs** - Check OpenAI usage during verification
 4. **Review match quality periodically** - Use `review_match_quality.py` to analyze results
 5. **Test first** - Use `--test --limit 50` flags before full runs
 ## Troubleshooting
 ### LLM verification is slow
 - Increase `--concurrent` parameter (try 100-200)
 - Check OpenAI rate limits (1,000 RPM for Tier 1)
 ### Many low-quality matches
 - Review and adjust V6 prompt in `verify_twitter_matches_v2.py`
 - Check `review_match_quality.py` for insights
 ### Missing obvious matches
 - Check if candidate was found: Query `twitter_match_candidates`
 - If not found, may need new matching method
 - If found but not verified, check LLM reasoning in `llm_verdict`
 ## Future Enhancements
 - Add more matching methods (location, bio keywords, etc.)
 - Implement feedback loop for prompt improvement
 - Add manual review interface for borderline matches
 - Export matches to various formats
--- a/find_twitter_candidates.py
+++ b/find_twitter_candidates.py
@@ -0,0 +1,851 @@
 #!/usr/bin/env python3
 """
 Twitter-Telegram Candidate Finder
 Finds potential Twitter matches for Telegram contacts using:
 1. Handle extraction from bios
 2. Username variation generation
 3. Fuzzy name matching
 """
 import sys
 import re
 import json
 from pathlib import Path
 from typing import List, Dict, Set, Tuple
 import psycopg2
 from psycopg2.extras import DictCursor, execute_values
 # Add parent directory to path
 sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
 from db_config import SessionLocal
 from models import Contact
 # Twitter database connection (adjust as needed)
 TWITTER_DB_CONFIG = {
    'dbname': 'twitter_data',
    'user': 'andrewjiang',  # Adjust to your setup
    'host': 'localhost',
    'port': 5432
 }
 class HandleExtractor:
    """Extract Twitter handles from text"""
    @staticmethod
    def extract_handles(text: str) -> List[str]:
        """Extract Twitter handles from bio/text"""
        if not text:
            return []
        handles = set()
        # Pattern 1: @username
        pattern1 = r'@([a-zA-Z0-9_]{4,15})'
        handles.update(re.findall(pattern1, text))
        # Pattern 2: twitter.com/username or x.com/username
        pattern2 = r'(?:twitter\.com|x\.com)/([a-zA-Z0-9_]{4,15})'
        handles.update(re.findall(pattern2, text, re.IGNORECASE))
        # Pattern 3: Clean standalone handles (risky, be conservative)
        # Only if text is short and looks like a handle
        if len(text) < 30 and text.count('@') == 1:
            clean = text.strip('@').strip()
            if re.match(r'^[a-zA-Z0-9_]{4,15}$', clean):
                handles.add(clean)
        return [h.lower() for h in handles if len(h) >= 4]
 class UsernameVariationGenerator:
    """Generate Twitter handle variations from Telegram usernames"""
    @staticmethod
    def generate_variations(telegram_username: str) -> List[str]:
        """
        Generate possible Twitter handle variations
        Examples:
        - alice_0x → [alice_0x, 0xalice, 0x_alice, alice0x]
        - trader_69 → [trader_69, 69trader, 69_trader, trader69]
        """
        if not telegram_username:
            return []
        variations = [telegram_username.lower()]
        # Remove underscores
        no_underscore = telegram_username.replace('_', '')
        if no_underscore != telegram_username and len(no_underscore) >= 4:
            variations.append(no_underscore.lower())
        # Handle "0x" patterns (common in crypto)
        if '0x' in telegram_username.lower():
            # If 0x at end, try moving to front
            if telegram_username.lower().endswith('_0x'):
                base = telegram_username[:-3]
                variations.extend([
                    f"0x{base}".lower(),
                    f"0x_{base}".lower()
                ])
            elif telegram_username.lower().endswith('0x'):
                base = telegram_username[:-2]
                variations.extend([
                    f"0x{base}".lower(),
                    f"0x_{base}".lower()
                ])
            # Try without underscore
            no_under = telegram_username.replace('_', '')
            if '0x' in no_under.lower() and no_under.lower() not in variations:
                variations.append(no_under.lower())
        # Handle trailing numbers (alice_69 → 69alice)
        match = re.match(r'^([a-z_]+?)_?(\d+)$', telegram_username, re.IGNORECASE)
        if match:
            prefix, number = match.groups()
            prefix = prefix.rstrip('_')
            if len(f"{number}{prefix}") >= 4:
                variations.extend([
                    f"{number}{prefix}".lower(),
                    f"{number}_{prefix}".lower()
                ])
        # Handle leading numbers (69_alice → alice69)
        match = re.match(r'^(\d+)_?([a-z_]+)$', telegram_username, re.IGNORECASE)
        if match:
            number, suffix = match.groups()
            suffix = suffix.lstrip('_')
            if len(f"{suffix}{number}") >= 4:
                variations.extend([
                    f"{suffix}{number}".lower(),
                    f"{suffix}_{number}".lower()
                ])
        # Single character removals (banteg → bantg, trader → trade)
        # This catches shortened versions of usernames
        base = telegram_username.lower()
        if len(base) > 4:  # Only if result would be at least 4 chars
            for i in range(len(base)):
                variation = base[:i] + base[i+1:]
                if len(variation) >= 4:
                    variations.append(variation)
        # Deduplicate and validate (4-15 chars for Twitter)
        valid = []
        for v in set(variations):
            v_clean = v.strip('_')
            if 4 <= len(v_clean) <= 15:
                valid.append(v_clean)
        return list(set(valid))
 class TwitterMatcher:
    """Find Twitter profiles matching Telegram contacts"""
    def __init__(self, twitter_conn, telegram_conn=None):
        self.twitter_conn = twitter_conn
        self.telegram_conn = telegram_conn  # Needed for phash lookups
        self.handle_extractor = HandleExtractor()
        self.variation_generator = UsernameVariationGenerator()
    def find_by_handle(self, handle: str) -> Dict:
        """Lookup Twitter profile by exact handle"""
        with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
            cur.execute("""
                SELECT
                    id,
                    username,
                    name,
                    description,
                    location,
                    verified,
                    is_blue_verified,
                    followers_count,
                    following_count,
                    created_at
                FROM public.users
                WHERE LOWER(username) = %s
                LIMIT 1
            """, (handle.lower(),))
            result = cur.fetchone()
            return dict(result) if result else None
    def find_by_fuzzy_name(self, telegram_name: str, limit=3) -> List[Dict]:
        """Find Twitter profiles with similar names using fuzzy matching"""
        if not telegram_name or len(telegram_name) < 3:
            return []
        with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
            # Use parameterized query with proper escaping for % operator
            cur.execute("""
                SELECT
                    id,
                    username,
                    name,
                    description,
                    location,
                    verified,
                    is_blue_verified,
                    followers_count,
                    following_count,
                    created_at,
                    similarity(name, %(name)s) AS name_score
                FROM public.users
                WHERE name %% %(name)s  -- %% for similarity operator (escaped in string)
                  AND similarity(name, %(name)s) > 0.65  -- Increased threshold from 0.5 to reduce noise
                ORDER BY name_score DESC
                LIMIT %(limit)s
            """, {'name': telegram_name, 'limit': limit})
            return [dict(row) for row in cur.fetchall()]
    def find_by_display_name_containment(self, telegram_name: str, limit=5) -> List[Dict]:
        """Find Twitter profiles where TG display name is contained in TW display name"""
        if not telegram_name or len(telegram_name) < 3:
            return []
        with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
            # Direct containment search (case-insensitive)
            cur.execute("""
                SELECT
                    id,
                    username,
                    name,
                    description,
                    location,
                    verified,
                    is_blue_verified,
                    followers_count,
                    following_count,
                    created_at
                FROM public.users
                WHERE name ILIKE %s  -- TG name contained in TW name
                  AND LENGTH(name) >= %s  -- TW name must be at least as long as TG name
                ORDER BY followers_count DESC  -- Prioritize by follower count
                LIMIT %s
            """, (f'%{telegram_name}%', len(telegram_name), limit))
            return [dict(row) for row in cur.fetchall()]
    def find_by_phash_match(self, telegram_user_id: int) -> List[Dict]:
        """Find Twitter profiles with matching profile picture phash (distance 0-1 only)"""
        if not self.telegram_conn:
            return []
        with self.telegram_conn.cursor(cursor_factory=DictCursor) as cur:
            # Query pre-computed phash matches (distance 0-1 for high confidence)
            cur.execute("""
                SELECT
                    m.twitter_user_id,
                    m.twitter_username,
                    m.hamming_distance,
                    m.telegram_phash,
                    m.twitter_phash
                FROM telegram_twitter_phash_matches m
                WHERE m.telegram_user_id = %s
                  AND m.hamming_distance <= 1  -- Only exact and distance-1 matches
                ORDER BY m.hamming_distance ASC
                LIMIT 5
            """, (telegram_user_id,))
            phash_matches = cur.fetchall()
        # Commit to close transaction
        self.telegram_conn.commit()
        if not phash_matches:
            return []
        # Fetch full Twitter profile data for matched users
        twitter_ids = [m['twitter_user_id'] for m in phash_matches]
        with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
            cur.execute("""
                SELECT
                    id,
                    username,
                    name,
                    description,
                    location,
                    verified,
                    is_blue_verified,
                    followers_count,
                    following_count,
                    created_at
                FROM public.users
                WHERE id = ANY(%s)
            """, (twitter_ids,))
            twitter_profiles = [dict(row) for row in cur.fetchall()]
        # Enrich with phash match details
        twitter_profile_map = {str(p['id']): p for p in twitter_profiles}
        results = []
        for match in phash_matches:
            tw_id = match['twitter_user_id']
            if tw_id in twitter_profile_map:
                profile = twitter_profile_map[tw_id].copy()
                profile['phash_distance'] = match['hamming_distance']
                profile['telegram_phash'] = match['telegram_phash']
                profile['twitter_phash'] = match['twitter_phash']
                results.append(profile)
        return results
    def find_by_telegram_in_twitter_bio(self, telegram_username: str, limit=3) -> List[Dict]:
        """Find Twitter profiles that mention this Telegram username in their bio (exact @mention only)"""
        if not telegram_username or len(telegram_username) < 4:
            return []
        with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
            # Search for various Telegram handle patterns in Twitter bios
            # Use regex with word boundaries to avoid substring matches in other handles
            cur.execute("""
                SELECT
                    id,
                    username,
                    name,
                    description,
                    location,
                    verified,
                    is_blue_verified,
                    followers_count,
                    following_count,
                    created_at
                FROM public.users
                WHERE description IS NOT NULL
                  AND (
                    description ~* %s  -- @username with word boundary (not part of longer handle)
                    OR LOWER(description) LIKE %s  -- t.me/username
                    OR LOWER(description) LIKE %s  -- telegram.me/username
                  )
                ORDER BY followers_count DESC
                LIMIT %s
            """, (
                r'@' + telegram_username + r'(\s|$|[^a-zA-Z0-9_])',  # Word boundary: space, end, or non-alphanumeric
                f'%t.me/{telegram_username.lower()}%',
                f'%telegram.me/{telegram_username.lower()}%',
                limit
            ))
            return [dict(row) for row in cur.fetchall()]
    def find_by_resolved_url(self, telegram_username: str) -> List[Dict]:
        """Find Twitter profiles whose bio URL resolves to this Telegram username"""
        if not telegram_username or len(telegram_username) < 4:
            return []
        with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
            # Query url_resolution_queue for Twitter profiles with resolved Telegram links
            cur.execute("""
                SELECT DISTINCT
                    u.id,
                    u.username,
                    u.name,
                    u.description,
                    u.location,
                    u.verified,
                    u.is_blue_verified,
                    u.followers_count,
                    u.following_count,
                    u.created_at,
                    urq.resolved_url,
                    urq.original_url
                FROM url_resolution_queue urq
                JOIN public.users u ON u.id::text = urq.twitter_id
                WHERE urq.telegram_handles IS NOT NULL
                  AND urq.telegram_handles @> %s::jsonb
                ORDER BY u.followers_count DESC
                LIMIT 3
            """, (f'["{telegram_username.lower()}"]',))
            results = cur.fetchall()
            # Enrich with URL details
            enriched = []
            for row in results:
                profile = dict(row)
                profile['resolved_telegram_url'] = profile.pop('resolved_url')
                profile['original_url'] = profile.pop('original_url')
                enriched.append(profile)
            return enriched
    def find_by_username_in_display_name(self, search_term: str, is_telegram: bool, limit: int = 5) -> List[Dict]:
        """
        Find Twitter profiles where display name contains username pattern
        If is_telegram=True: Search for TG username in Twitter display names
        If is_telegram=False: Search for Twitter username pattern in TG display names (reverse: find Twitter profiles whose username matches the TG display name)
        """
        with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
            if is_telegram:
                # TG username in Twitter display name
                cur.execute("""
                    SELECT
                        id,
                        username,
                        name,
                        description,
                        location,
                        verified,
                        is_blue_verified,
                        followers_count,
                        following_count,
                        created_at
                    FROM public.users
                    WHERE name IS NOT NULL
                      AND LOWER(name) LIKE %s
                    ORDER BY followers_count DESC
                    LIMIT %s
                """, (f'%{search_term.lower()}%', limit))
            else:
                # Twitter username in TG display name - search for Twitter profiles whose username appears in the search term
                cur.execute("""
                    SELECT
                        id,
                        username,
                        name,
                        description,
                        location,
                        verified,
                        is_blue_verified,
                        followers_count,
                        following_count,
                        created_at
                    FROM public.users
                    WHERE username IS NOT NULL
                      AND LOWER(%s) LIKE '%%' || LOWER(username) || '%%'
                    ORDER BY followers_count DESC
                    LIMIT %s
                """, (search_term, limit))
            return [dict(row) for row in cur.fetchall()]
    def get_display_name(self, contact: Contact) -> str:
        """Get display name with fallback to first_name + last_name"""
        if contact.display_name:
            return contact.display_name
        # Fallback to first_name + last_name
        parts = [contact.first_name, contact.last_name]
        return ' '.join(p for p in parts if p).strip()
    def find_candidates_for_contact(self, contact: Contact) -> List[Dict]:
        """
        Find all Twitter candidates for a single Telegram contact
        Returns list of candidates with:
        {
            'twitter_profile': {...},
            'match_method': 'exact_bio_handle' | 'exact_username' | 'username_variation' | 'fuzzy_name',
            'baseline_confidence': 0.0-1.0,
            'match_details': {...}
        }
        """
        candidates = []
        seen_twitter_ids = set()
        # Get display name with fallback
        display_name = self.get_display_name(contact)
        # Method 1: Phash matching (profile picture similarity)
        phash_matches = self.find_by_phash_match(contact.user_id)
        for twitter_profile in phash_matches:
            phash_distance = twitter_profile.pop('phash_distance')
            telegram_phash = twitter_profile.pop('telegram_phash')
            twitter_phash = twitter_profile.pop('twitter_phash')
            # Baseline confidence based on phash distance
            if phash_distance == 0:
                baseline_confidence = 0.95  # Exact match - VERY strong signal
            elif phash_distance == 1:
                baseline_confidence = 0.88  # 1-bit difference - strong signal
            else:
                continue  # Skip distance > 1
            candidates.append({
                'twitter_profile': twitter_profile,
                'match_method': 'phash_match',
                'baseline_confidence': baseline_confidence,
                'match_details': {
                    'phash_distance': phash_distance,
                    'telegram_phash': telegram_phash,
                    'twitter_phash': twitter_phash
                }
            })
            seen_twitter_ids.add(twitter_profile['id'])
        # Method 2: Extract handles from bio
        if contact.bio:
            bio_handles = self.handle_extractor.extract_handles(contact.bio)
            for handle in bio_handles:
                twitter_profile = self.find_by_handle(handle)
                if twitter_profile and twitter_profile['id'] not in seen_twitter_ids:
                    candidates.append({
                        'twitter_profile': twitter_profile,
                        'match_method': 'exact_bio_handle',
                        'baseline_confidence': 0.95,
                        'match_details': {
                            'extracted_handle': handle,
                            'from_bio': True
                        }
                    })
                    seen_twitter_ids.add(twitter_profile['id'])
        # Method 3: Exact username match
        if contact.username:
            twitter_profile = self.find_by_handle(contact.username)
            if twitter_profile and twitter_profile['id'] not in seen_twitter_ids:
                candidates.append({
                    'twitter_profile': twitter_profile,
                    'match_method': 'exact_username',
                    'baseline_confidence': 0.90,
                    'match_details': {
                        'telegram_username': contact.username,
                        'twitter_username': twitter_profile['username']
                    }
                })
                seen_twitter_ids.add(twitter_profile['id'])
        # Method 4: Username variations
        if contact.username:
            variations = self.variation_generator.generate_variations(contact.username)
            for variation in variations:
                if variation == contact.username.lower():
                    continue  # Already checked in Method 2
                twitter_profile = self.find_by_handle(variation)
                if twitter_profile and twitter_profile['id'] not in seen_twitter_ids:
                    candidates.append({
                        'twitter_profile': twitter_profile,
                        'match_method': 'username_variation',
                        'baseline_confidence': 0.80,
                        'match_details': {
                            'telegram_username': contact.username,
                            'username_variation': variation,
                            'twitter_username': twitter_profile['username']
                        }
                    })
                    seen_twitter_ids.add(twitter_profile['id'])
        # Method 5: Twitter bio contains Telegram username (reverse lookup)
        if contact.username:
            reverse_matches = self.find_by_telegram_in_twitter_bio(contact.username, limit=3)
            for twitter_profile in reverse_matches:
                if twitter_profile['id'] not in seen_twitter_ids:
                    candidates.append({
                        'twitter_profile': twitter_profile,
                        'match_method': 'twitter_bio_has_telegram',
                        'baseline_confidence': 0.92,  # Very high confidence - they explicitly mention their Telegram
                        'match_details': {
                            'telegram_username': contact.username,
                            'twitter_username': twitter_profile['username'],
                            'found_in_twitter_bio': True
                        }
                    })
                    seen_twitter_ids.add(twitter_profile['id'])
        # Method 5b: Twitter bio URL resolves to Telegram username (via url_resolution_queue)
        if contact.username:
            url_resolved_matches = self.find_by_resolved_url(contact.username)
            for twitter_profile in url_resolved_matches:
                if twitter_profile['id'] not in seen_twitter_ids:
                    resolved_url = twitter_profile.pop('resolved_telegram_url', None)
                    original_url = twitter_profile.pop('original_url', None)
                    candidates.append({
                        'twitter_profile': twitter_profile,
                        'match_method': 'twitter_bio_url_resolves_to_telegram',
                        'baseline_confidence': 0.95,  # VERY high confidence - explicit URL link in bio
                        'match_details': {
                            'telegram_username': contact.username,
                            'twitter_username': twitter_profile['username'],
                            'resolved_url': resolved_url,
                            'original_url': original_url,
                            'found_via_url_resolution': True
                        }
                    })
                    seen_twitter_ids.add(twitter_profile['id'])
        # Method 6: Display name containment (TG name in TW name)
        if display_name:
            containment_matches = self.find_by_display_name_containment(display_name, limit=5)
            for twitter_profile in containment_matches:
                if twitter_profile['id'] not in seen_twitter_ids:
                    candidates.append({
                        'twitter_profile': twitter_profile,
                        'match_method': 'display_name_containment',
                        'baseline_confidence': 0.92,  # High confidence for exact name containment
                        'match_details': {
                            'telegram_name': display_name,
                            'twitter_name': twitter_profile['name'],
                            'match_type': 'tg_name_contained_in_tw_name'
                        }
                    })
                    seen_twitter_ids.add(twitter_profile['id'])
        # Method 7: Fuzzy name match (always run to find additional candidates)
        if display_name:
            fuzzy_matches = self.find_by_fuzzy_name(display_name, limit=5)
            for i, twitter_profile in enumerate(fuzzy_matches):
                name_score = twitter_profile.pop('name_score')
                # Calculate baseline confidence from name similarity
                baseline_confidence = min(0.85, name_score)  # Cap at 0.85 for fuzzy matches
                if twitter_profile['id'] not in seen_twitter_ids:
                    candidates.append({
                        'twitter_profile': twitter_profile,
                        'match_method': 'fuzzy_name',
                        'baseline_confidence': baseline_confidence,
                        'match_details': {
                            'telegram_name': display_name,
                            'twitter_name': twitter_profile['name'],
                            'fuzzy_score': name_score,
                            'candidate_rank': i + 1
                        }
                    })
                    seen_twitter_ids.add(twitter_profile['id'])
        # Method 8: TG username in Twitter display name
        if contact.username:
            matches = self.find_by_username_in_display_name(contact.username, is_telegram=True, limit=3)
            for twitter_profile in matches:
                if twitter_profile['id'] not in seen_twitter_ids:
                    candidates.append({
                        'twitter_profile': twitter_profile,
                        'match_method': 'tg_username_in_twitter_name',
                        'baseline_confidence': 0.88,
                        'match_details': {
                            'telegram_username': contact.username,
                            'twitter_name': twitter_profile['name'],
                            'found_in_display_name': True
                        }
                    })
                    seen_twitter_ids.add(twitter_profile['id'])
        # Method 9: Twitter username in TG display name
        if display_name:
            matches = self.find_by_username_in_display_name(display_name, is_telegram=False, limit=3)
            for twitter_profile in matches:
                if twitter_profile['id'] not in seen_twitter_ids:
                    candidates.append({
                        'twitter_profile': twitter_profile,
                        'match_method': 'twitter_username_in_tg_name',
                        'baseline_confidence': 0.86,
                        'match_details': {
                            'telegram_name': display_name,
                            'twitter_username': twitter_profile['username'],
                            'username_in_display_name': True
                        }
                    })
                    seen_twitter_ids.add(twitter_profile['id'])
        return candidates
 def save_candidates_to_db(telegram_user_id: int, account_id: int, candidates: List[Dict], telegram_db):
    """Save candidates to twitter_match_candidates table"""
    if not candidates:
        return
    insert_data = []
    for cand in candidates:
        tw = cand['twitter_profile']
        insert_data.append((
            account_id,  # Add account_id
            telegram_user_id,
            tw['id'],
            tw['username'],
            tw['name'],
            tw.get('description', ''),
            tw.get('location'),
            tw.get('verified', False),
            tw.get('is_blue_verified', False),
            tw.get('followers_count', 0),
            cand.get('match_details', {}).get('candidate_rank', 1),
            cand['match_method'],
            cand['baseline_confidence'],
            json.dumps(cand['match_details'])  # Convert dict to JSON string
        ))
    execute_values(telegram_db, """
        INSERT INTO twitter_match_candidates (
            account_id,
            telegram_user_id,
            twitter_id,
            twitter_username,
            twitter_name,
            twitter_bio,
            twitter_location,
            twitter_verified,
            twitter_blue_verified,
            twitter_followers_count,
            candidate_rank,
            match_method,
            baseline_confidence,
            match_signals
        ) VALUES %s
        ON CONFLICT DO NOTHING
    """, insert_data, template="""(
        %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s
    )""")
 def main():
    print()
    print("=" * 70)
    print("🔍 Twitter-Telegram Candidate Finder")
    print("=" * 70)
    print()
    # Check arguments
    test_mode = '--test' in sys.argv
    limit = 100 if test_mode else None
    # Check for --limit parameter
    if '--limit' in sys.argv:
        idx = sys.argv.index('--limit')
        limit = int(sys.argv[idx + 1])
        print(f"📊 LIMIT MODE: Processing first {limit:,} contacts")
        print()
    elif test_mode:
        print("🧪 TEST MODE: Processing first 100 contacts only")
        print()
    # Connect to databases
    print("📡 Connecting to databases...")
    # Connect SQLAlchemy to localhost (not RDS)
    from sqlalchemy import create_engine
    from sqlalchemy.orm import sessionmaker
    localhost_engine = create_engine('postgresql://andrewjiang@localhost:5432/telegram_contacts')
    LocalSession = sessionmaker(bind=localhost_engine)
    telegram_db = LocalSession()
    try:
        twitter_conn = psycopg2.connect(**TWITTER_DB_CONFIG)
        twitter_conn.autocommit = False
    except Exception as e:
        print(f"❌ Failed to connect to Twitter database: {e}")
        print(f"   Config: {TWITTER_DB_CONFIG}")
        return False
    # Also need psycopg2 connection to telegram DB for writing
    try:
        telegram_conn = psycopg2.connect(dbname='telegram_contacts', user='andrewjiang', host='localhost', port=5432)
        telegram_conn.autocommit = False
    except Exception as e:
        print(f"❌ Failed to connect to Telegram database: {e}")
        return False
    print("✅ Connected to both databases")
    print()
    try:
        # Initialize matcher (pass both connections for phash lookup)
        telegram_psycopg_conn = psycopg2.connect(
            dbname='telegram_contacts',
            user='andrewjiang',
            host='localhost'
        )
        matcher = TwitterMatcher(twitter_conn, telegram_psycopg_conn)
        # Get Telegram contacts with bios or usernames
        print("🔍 Loading Telegram contacts...")
        query = telegram_db.query(Contact).filter(
            Contact.user_id > 0,  # Exclude channels
            Contact.is_deleted == False,
            Contact.is_bot == False
        ).filter(
            (Contact.bio != None) | (Contact.username != None)
        ).order_by(Contact.user_id)
        if limit:
            query = query.limit(limit)
        contacts = query.all()
        print(f"✅ Found {len(contacts):,} contacts to process")
        print()
        # Process each contact
        stats = {
            'processed': 0,
            'with_candidates': 0,
            'total_candidates': 0,
            'by_method': {}
        }
        print("🚀 Finding Twitter candidates...")
        print()
        for i, contact in enumerate(contacts, 1):
            candidates = matcher.find_candidates_for_contact(contact)
            if candidates:
                stats['with_candidates'] += 1
                stats['total_candidates'] += len(candidates)
                # Track by method
                for cand in candidates:
                    method = cand['match_method']
                    stats['by_method'][method] = stats['by_method'].get(method, 0) + 1
                # Save to database
                with telegram_conn.cursor() as cur:
                    save_candidates_to_db(contact.user_id, contact.account_id, candidates, cur)
                    telegram_conn.commit()
            stats['processed'] += 1
            # Progress update every 10 contacts
            if i % 10 == 0:
                print(f"  Processed {i:,}/{len(contacts):,} contacts... (candidates: {stats['with_candidates']:,}, total: {stats['total_candidates']:,})")
            elif i == 1:
                # Show first contact immediately
                print(f"  Processed 1/{len(contacts):,} contacts... (candidates: {stats['with_candidates']:,}, total: {stats['total_candidates']:,})")
        # Final stats
        print()
        print("=" * 70)
        print("✅ CANDIDATE FINDING COMPLETE")
        print("=" * 70)
        print()
        print(f"📊 Statistics:")
        print(f"   Processed: {stats['processed']:,} contacts")
        print(f"   With candidates: {stats['with_candidates']:,} ({stats['with_candidates']/stats['processed']*100:.1f}%)")
        print(f"   Total candidates: {stats['total_candidates']:,}")
        print(f"   Avg candidates per match: {stats['total_candidates']/max(stats['with_candidates'], 1):.1f}")
        print()
        print(f"📈 By method:")
        for method, count in sorted(stats['by_method'].items(), key=lambda x: -x[1]):
            print(f"   {method}: {count:,}")
        print()
        return True
    except Exception as e:
        print(f"❌ Error: {e}")
        import traceback
        traceback.print_exc()
        return False
    finally:
        telegram_db.close()
        twitter_conn.close()
        telegram_conn.close()
 if __name__ == "__main__":
    try:
        success = main()
        sys.exit(0 if success else 1)
    except KeyboardInterrupt:
        print("\n\n⚠️  Interrupted by user")
        sys.exit(1)
--- a/find_twitter_candidates_threaded.py
+++ b/find_twitter_candidates_threaded.py
@@ -0,0 +1,389 @@
 #!/usr/bin/env python3
 """
 Threading-based Twitter-Telegram Candidate Finder
 Uses threading instead of multiprocessing for better macOS compatibility
 and efficient I/O-bound parallel processing.
 """
 import sys
 from pathlib import Path
 import psycopg2
 from psycopg2.extras import execute_values, DictCursor
 from concurrent.futures import ThreadPoolExecutor, as_completed
 import argparse
 from typing import List, Dict, Tuple
 import time
 import threading
 # Add parent directory to path
 sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
 from sqlalchemy import create_engine
 from sqlalchemy.orm import sessionmaker
 from models import Contact
 # Import the matcher from the original script
 from find_twitter_candidates import TwitterMatcher, TWITTER_DB_CONFIG
 # Database configuration
 TELEGRAM_DB_URL = 'postgresql://andrewjiang@localhost:5432/telegram_contacts'
 # Thread-local storage for database connections
 thread_local = threading.local()
 def get_thread_connections():
    """Get or create database connections for this thread"""
    if not hasattr(thread_local, 'twitter_conn'):
        thread_local.twitter_conn = psycopg2.connect(**TWITTER_DB_CONFIG)
        thread_local.telegram_conn = psycopg2.connect(
            dbname='telegram_contacts',
            user='andrewjiang',
            host='localhost',
            port=5432
        )
        thread_local.matcher = TwitterMatcher(
            thread_local.twitter_conn,
            thread_local.telegram_conn
        )
    return thread_local.twitter_conn, thread_local.telegram_conn, thread_local.matcher
 def process_contact(contact: Contact) -> Tuple[bool, int, List[Dict], Dict]:
    """
    Process a single contact in a worker thread.
    Returns: (success, num_candidates, candidates, method_stats)
    """
    import sys
    import io
    # Capture stderr to detect silent failures
    old_stderr = sys.stderr
    sys.stderr = stderr_capture = io.StringIO()
    try:
        # Get thread-local database connections
        twitter_conn, telegram_conn, matcher = get_thread_connections()
        # Find candidates
        candidates = matcher.find_candidates_for_contact(contact)
        # Check if any warnings were logged
        stderr_content = stderr_capture.getvalue()
        if stderr_content:
            print(f"  🔍 Warnings for contact {contact.user_id}:")
            print(f"       {stderr_content}")
        method_stats = {}
        if candidates:
            # Track method stats
            for cand in candidates:
                method = cand['match_method']
                method_stats[method] = method_stats.get(method, 0) + 1
            # Add contact info to each candidate for later insertion
            for cand in candidates:
                cand['telegram_user_id'] = contact.user_id
                cand['account_id'] = contact.account_id
        return True, len(candidates), candidates, method_stats
    except Exception as e:
        print(f"  ⚠️  Error processing contact {contact.user_id}: {e}")
        print(f"       Contact details: username={contact.username}, display_name={getattr(contact, 'display_name', None)}")
        import traceback
        traceback.print_exc()
        return False, 0, [], {}
    finally:
        sys.stderr = old_stderr
 def save_candidates_batch(candidates: List[Dict], telegram_conn):
    """Save a batch of candidates to database"""
    if not candidates:
        return 0
    insert_data = []
    for cand in candidates:
        tw = cand['twitter_profile']
        match_signals = cand.get('match_details', {})
        insert_data.append((
            cand['account_id'],
            cand['telegram_user_id'],
            tw['id'],
            tw['username'],
            tw.get('name'),
            tw.get('description'),
            tw.get('location'),
            tw.get('verified', False),
            tw.get('is_blue_verified', False),
            tw.get('followers_count', 0),
            0,  # candidate_rank (will be set later if needed)
            cand['match_method'],
            cand['baseline_confidence'],
            psycopg2.extras.Json(match_signals),
            True,  # needs_llm_review
            False,  # llm_processed
        ))
    with telegram_conn.cursor() as cur:
        execute_values(cur, """
            INSERT INTO twitter_match_candidates (
                account_id,
                telegram_user_id,
                twitter_id,
                twitter_username,
                twitter_name,
                twitter_bio,
                twitter_location,
                twitter_verified,
                twitter_blue_verified,
                twitter_followers_count,
                candidate_rank,
                match_method,
                baseline_confidence,
                match_signals,
                needs_llm_review,
                llm_processed
            ) VALUES %s
            ON CONFLICT (telegram_user_id, twitter_id) DO NOTHING
        """, insert_data, page_size=1000)
        telegram_conn.commit()
    return len(insert_data)
 def main():
    parser = argparse.ArgumentParser(description='Find Twitter candidates for Telegram contacts (threading-based)')
    parser.add_argument('--limit', type=int, help='Limit number of Telegram contacts to process')
    parser.add_argument('--test', action='store_true', help='Test mode: process first 100 contacts only')
    parser.add_argument('--workers', type=int, default=8,
                       help='Number of worker threads (default: 8)')
    parser.add_argument('--user-id-min', type=int, help='Minimum user_id to process (for parallel ranges)')
    parser.add_argument('--user-id-max', type=int, help='Maximum user_id to process (for parallel ranges)')
    parser.add_argument('--range-name', type=str, help='Name for this range (for logging)')
    args = parser.parse_args()
    num_workers = args.workers
    limit = args.limit
    user_id_min = args.user_id_min
    user_id_max = args.user_id_max
    range_name = args.range_name or "full"
    print("=" * 70)
    print(f"🔍 Twitter-Telegram Candidate Finder (THREADED) - Range: {range_name}")
    print("=" * 70)
    print()
    if args.test:
        limit = 100
        print("🧪 TEST MODE: Processing first 100 contacts only")
        print()
    elif limit:
        print(f"📊 LIMIT MODE: Processing first {limit:,} contacts")
        print()
    if user_id_min is not None and user_id_max is not None:
        print(f"📍 User ID Range: {user_id_min:,} to {user_id_max:,}")
        print()
    print(f"🧵 Worker threads: {num_workers}")
    print()
    # Load contacts using raw psycopg2
    print("📡 Loading Telegram contacts...")
    conn = psycopg2.connect(
        dbname='telegram_contacts',
        user='andrewjiang',
        host='localhost',
        port=5432
    )
    with conn.cursor() as cur:
        # First, get list of already processed contacts
        cur.execute("""
            SELECT DISTINCT telegram_user_id
            FROM twitter_match_candidates
        """)
        already_processed = set(row[0] for row in cur.fetchall())
        print(f"📋 Already processed: {len(already_processed):,} contacts (will skip)")
        print()
        query = """
            SELECT account_id, user_id, display_name, first_name, last_name, username, phone, bio, is_bot, is_deleted
            FROM contacts
            WHERE user_id > 0
              AND is_deleted = false
              AND is_bot = false
              AND (bio IS NOT NULL OR username IS NOT NULL)
        """
        # Add user_id range filter if specified
        if user_id_min is not None:
            query += f" AND user_id >= {user_id_min}"
        if user_id_max is not None:
            query += f" AND user_id <= {user_id_max}"
        query += " ORDER BY user_id"
        if limit:
            query += f" LIMIT {limit}"
        cur.execute(query)
        rows = cur.fetchall()
    conn.close()
    # Convert to Contact objects, skipping already processed
    contacts = []
    skipped = 0
    for row in rows:
        user_id = row[1]
        # Skip if already processed
        if user_id in already_processed:
            skipped += 1
            continue
        contact = Contact(
            account_id=row[0],
            user_id=user_id,
            display_name=row[2],
            first_name=row[3],
            last_name=row[4],
            username=row[5],
            phone=row[6],
            bio=row[7],
            is_bot=row[8],
            is_deleted=row[9]
        )
        contacts.append(contact)
    total_contacts = len(contacts)
    print(f"✅ Found {total_contacts:,} NEW contacts to process (skipped {skipped:,} already done)")
    print()
    print("🚀 Processing contacts with thread pool...")
    print()
    start_time = time.time()
    # Stats tracking
    total_processed = 0
    total_with_candidates = 0
    all_candidates = []
    combined_method_stats = {}
    total_saved = 0
    # Database connection for incremental saves
    telegram_conn = psycopg2.connect(
        dbname='telegram_contacts',
        user='andrewjiang',
        host='localhost',
        port=5432
    )
    # Process with ThreadPoolExecutor
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        # Submit all tasks
        future_to_contact = {
            executor.submit(process_contact, contact): contact
            for contact in contacts
        }
        # Process results as they complete
        for i, future in enumerate(as_completed(future_to_contact), 1):
            try:
                success, num_candidates, candidates, method_stats = future.result()
                if success:
                    total_processed += 1
                    if candidates:
                        total_with_candidates += 1
                        all_candidates.extend(candidates)
                        # Update method stats
                        for method, count in method_stats.items():
                            combined_method_stats[method] = combined_method_stats.get(method, 0) + count
                # Save to database every 100 contacts
                if len(all_candidates) >= 100:
                    saved = save_candidates_batch(all_candidates, telegram_conn)
                    total_saved += saved
                    all_candidates = []  # Clear buffer
                # Progress update every 10 contacts
                if i % 10 == 0 or i == 1:
                    elapsed = time.time() - start_time
                    rate = total_processed / elapsed if elapsed > 0 else 0
                    remaining = total_contacts - total_processed
                    eta_seconds = remaining / rate if rate > 0 else 0
                    eta_hours = eta_seconds / 3600
                    print(f"  Progress: {total_processed}/{total_contacts} ({total_processed/total_contacts*100:.1f}%) | "
                          f"{total_saved + len(all_candidates):,} candidates (💾 {total_saved:,} saved) | "
                          f"Rate: {rate:.1f}/sec | "
                          f"ETA: {eta_hours:.1f}h")
            except Exception as e:
                print(f"  ❌ Error processing future: {e}")
                import traceback
                traceback.print_exc()
    # Save any remaining candidates
    if all_candidates:
        saved = save_candidates_batch(all_candidates, telegram_conn)
        total_saved += saved
    telegram_conn.close()
    elapsed = time.time() - start_time
    print()
    print(f"⏱️  Processing completed in {elapsed:.1f} seconds ({elapsed/60:.1f} minutes)")
    print(f"   Rate: {total_processed/elapsed:.1f} contacts/second")
    print()
    print(f"💾 Total saved: {total_saved:,} candidates")
    print()
    # Print stats
    print("=" * 70)
    print("✅ CANDIDATE FINDING COMPLETE")
    print("=" * 70)
    print()
    print(f"📊 Statistics:")
    print(f"   Processed: {total_processed:,} contacts")
    print(f"   With candidates: {total_with_candidates:,} ({total_with_candidates/total_processed*100:.1f}%)")
    print(f"   Total candidates: {total_saved:,}")
    if total_with_candidates > 0:
        print(f"   Avg candidates per match: {total_saved/total_with_candidates:.1f}")
    print()
    print(f"📈 By method:")
    for method, count in sorted(combined_method_stats.items(), key=lambda x: x[1], reverse=True):
        print(f"   {method}: {count:,}")
    print()
    return True
 if __name__ == "__main__":
    try:
        success = main()
        sys.exit(0 if success else 1)
    except KeyboardInterrupt:
        print("\n\n❌ Interrupted by user")
        sys.exit(1)
    except Exception as e:
        print(f"\n❌ Error: {e}")
        import traceback
        traceback.print_exc()
        sys.exit(1)
--- a/main.py
+++ b/main.py
@@ -0,0 +1,225 @@
 #!/usr/bin/env python3
 """
 Twitter-Telegram Profile Matching System
 Main menu for finding candidates and verifying matches with LLM
 """
 import sys
 import os
 import subprocess
 import psycopg2
 # Add parent directory to path for imports
 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
 sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', 'src'))
 # Database configuration
 DB_CONFIG = {
    'dbname': 'telegram_contacts',
    'user': 'andrewjiang',
    'host': 'localhost',
    'port': 5432
 }
 TWITTER_DB_CONFIG = {
    'dbname': 'twitter_data',
    'user': 'andrewjiang',
    'host': 'localhost',
    'port': 5432
 }
 def get_stats():
    """Get current matching statistics"""
    conn = psycopg2.connect(**DB_CONFIG)
    cur = conn.cursor()
    stats = {}
    # Candidates stats
    cur.execute("""
        SELECT
            COUNT(DISTINCT telegram_user_id) as total_users,
            COUNT(*) as total_candidates,
            COUNT(*) FILTER (WHERE llm_processed = TRUE) as processed_candidates,
            COUNT(*) FILTER (WHERE llm_processed = FALSE) as pending_candidates
        FROM twitter_match_candidates
    """)
    row = cur.fetchone()
    stats['total_users'] = row[0]
    stats['total_candidates'] = row[1]
    stats['processed_candidates'] = row[2]
    stats['pending_candidates'] = row[3]
    # Matches stats
    cur.execute("""
        SELECT
            COUNT(*) as total_matches,
            AVG(final_confidence) as avg_confidence,
            COUNT(*) FILTER (WHERE final_confidence >= 0.90) as high_conf,
            COUNT(*) FILTER (WHERE final_confidence >= 0.80 AND final_confidence < 0.90) as med_conf,
            COUNT(*) FILTER (WHERE final_confidence >= 0.70 AND final_confidence < 0.80) as low_conf
        FROM twitter_telegram_matches
    """)
    row = cur.fetchone()
    stats['total_matches'] = row[0]
    stats['avg_confidence'] = row[1] or 0
    stats['high_conf'] = row[2]
    stats['med_conf'] = row[3]
    stats['low_conf'] = row[4]
    # Users with matches
    cur.execute("""
        SELECT COUNT(DISTINCT telegram_user_id)
        FROM twitter_telegram_matches
    """)
    stats['users_with_matches'] = cur.fetchone()[0]
    cur.close()
    conn.close()
    return stats
 def print_header():
    """Print main header"""
    print()
    print("=" * 80)
    print("🔗 Twitter-Telegram Profile Matching System")
    print("=" * 80)
    print()
 def print_stats():
    """Print current statistics"""
    stats = get_stats()
    print("📊 Current Statistics:")
    print("-" * 80)
    print(f"Candidates:")
    print(f"  • Users with candidates: {stats['total_users']:,}")
    print(f"  • Total candidates found: {stats['total_candidates']:,}")
    print(f"  • Processed by LLM: {stats['processed_candidates']:,}")
    print(f"  • Pending verification: {stats['pending_candidates']:,}")
    print()
    print(f"Verified Matches:")
    print(f"  • Users with matches: {stats['users_with_matches']:,}")
    print(f"  • Total matches: {stats['total_matches']:,}")
    print(f"  • Average confidence: {stats['avg_confidence']:.2f}")
    print(f"  • High confidence (90%+): {stats['high_conf']:,}")
    print(f"  • Medium confidence (80-89%): {stats['med_conf']:,}")
    print(f"  • Low confidence (70-79%): {stats['low_conf']:,}")
    print("-" * 80)
    print()
 def run_script(script_name, *args):
    """Run a Python script with arguments"""
    script_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), f"{script_name}.py")
    cmd = ['python3.10', script_path] + list(args)
    subprocess.run(cmd)
 def main():
    while True:
        print_header()
        print_stats()
        print("📋 Main Menu:")
        print()
        print("STEP 1: Find Candidates")
        print("  1. Find Twitter candidates (threaded, RECOMMENDED)")
        print("  2. Find Twitter candidates (single-threaded)")
        print()
        print("STEP 2: Verify with LLM")
        print("  3. Verify matches with LLM (async, RECOMMENDED)")
        print("  4. Verify matches with LLM (test mode - 50 users)")
        print()
        print("Analysis & Review")
        print("  5. Review match quality")
        print("  6. Show statistics only")
        print()
        print("  0. Exit")
        print()
        choice = input("👉 Enter your choice: ").strip()
        if choice == '0':
            print("\n👋 Goodbye!\n")
            break
        elif choice == '1':
            # Find candidates (threaded)
            print()
            print("🔍 Finding Twitter candidates (threaded mode)...")
            print()
            limit_input = input("👉 How many contacts? (press Enter for all): ").strip()
            workers = input("👉 Number of worker threads (default: 8): ").strip() or '8'
            if limit_input:
                run_script('find_twitter_candidates_threaded', '--limit', limit_input, '--workers', workers)
            else:
                run_script('find_twitter_candidates_threaded', '--workers', workers)
            input("\n✅ Press Enter to continue...")
        elif choice == '2':
            # Find candidates (single-threaded)
            print()
            print("🔍 Finding Twitter candidates (single-threaded mode)...")
            print()
            limit_input = input("👉 How many contacts? (press Enter for all): ").strip()
            if limit_input:
                run_script('find_twitter_candidates', '--limit', limit_input)
            else:
                run_script('find_twitter_candidates')
            input("\n✅ Press Enter to continue...")
        elif choice == '3':
            # Verify with LLM (async)
            print()
            print("🤖 Verifying matches with LLM (async mode)...")
            print()
            concurrent = input("👉 Concurrent requests (default: 100): ").strip() or '100'
            run_script('verify_twitter_matches_v2', '--verbose', '--concurrent', concurrent)
            input("\n✅ Press Enter to continue...")
        elif choice == '4':
            # Verify with LLM (test mode)
            print()
            print("🧪 Test mode: Verifying 50 users with LLM...")
            print()
            run_script('verify_twitter_matches_v2', '--test', '--limit', '50', '--verbose', '--concurrent', '10')
            input("\n✅ Press Enter to continue...")
        elif choice == '5':
            # Review match quality
            print()
            print("📊 Reviewing match quality...")
            print()
            run_script('review_match_quality')
            input("\n✅ Press Enter to continue...")
        elif choice == '6':
            # Just show stats, loop back to menu
            continue
        else:
            print("\n❌ Invalid choice. Please try again.\n")
            input("Press Enter to continue...")
 if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\n\n👋 Interrupted. Goodbye!\n")
        sys.exit(0)
--- a/review_match_quality.py
+++ b/review_match_quality.py
@@ -0,0 +1,406 @@
 #!/usr/bin/env python3
 """
 Critical Match Quality Reviewer
 Analyzes verification results with deep understanding of:
 - Twitter/Telegram/crypto culture
 - Common false positive patterns
 - Company vs personal account indicators
 - Context alignment
 """
 import re
 import json
 from pathlib import Path
 from typing import List, Dict, Tuple
 LOG_FILE = Path(__file__).parent.parent / 'verification_v2_log.txt'
 class MatchReviewer:
    """Critical reviewer with crypto/web3 domain knowledge"""
    # Company/product/team indicators
    COMPANY_INDICATORS = [
        r'\bis\s+(a|an)\s+\w+\s+(way|tool|platform|app|service|protocol)',
        r'official\s+(account|page|channel)',
        r'(team|official)\s*$',
        r'^(the|a)\s+\w+\s+(for|to)',
        r'brought to you by',
        r'hosted by',
        r'founded by',
        r'(community|project)\s+(account|page)',
        r'(dao|protocol|network)\s*$',
        r'building\s+(the|a)\s+',
    ]
    # Personal account indicators
    PERSONAL_INDICATORS = [
        r'(ceo|founder|co-founder|developer|builder|engineer|researcher)\s+at',
        r'working\s+(at|on|with)',
        r'^(i|i\'m|my)',
        r'(my|personal)\s+(views|opinions|thoughts)',
        r'\b(he/him|she/her|they/them)\b',
    ]
    # Crypto/Web3 keywords
    CRYPTO_KEYWORDS = [
        'web3', 'crypto', 'blockchain', 'defi', 'nft', 'dao', 'dapp',
        'ethereum', 'solana', 'bitcoin', 'polygon', 'base', 'arbitrum',
        'smart contract', 'token', 'wallet', 'metamask', 'coinbase',
        'yield', 'farming', 'staking', 'airdrop', 'whitelist', 'mint',
        'protocol', 'l1', 'l2', 'rollup', 'zk', 'evm'
    ]
    def __init__(self):
        self.issues = []
        self.stats = {
            'total_matches': 0,
            'false_positives': 0,
            'questionable': 0,
            'good_matches': 0,
            'company_accounts': 0,
            'weak_evidence': 0,
            'context_mismatch': 0,
        }
    def parse_log(self) -> List[Dict]:
        """Parse the verification log into structured data"""
        with open(LOG_FILE, 'r') as f:
            content = f.read()
        entries = []
        # Find all sections that start with TELEGRAM USER
        pattern = r'TELEGRAM USER: ([^\(]+) \(ID: (\d+)\)(.*?)(?=TELEGRAM USER:|$)'
        matches = re.finditer(pattern, content, re.DOTALL)
        for match in matches:
            tg_username = match.group(1).strip()
            tg_id = int(match.group(2))
            section = match.group(3)
            # Extract TG profile
            tg_bio_match = re.search(r'Bio: (.*?)(?:\n+TWITTER CANDIDATES)', section, re.DOTALL)
            tg_bio = tg_bio_match.group(1).strip() if tg_bio_match else ''
            # Extract username specificity
            spec_match = re.search(r'username has specificity score ([\d.]+)', section)
            specificity = float(spec_match.group(1)) if spec_match else 0.5
            # Extract candidates
            candidates = []
            candidate_blocks = re.findall(
                r'\[Candidate (\d+)\](.*?)(?=\[Candidate \d+\]|LLM RESPONSE:)',
                section,
                re.DOTALL
            )
            for idx, block in candidate_blocks:
                tw_username = re.search(r'Twitter Username: @(\S+)', block)
                tw_name = re.search(r'Twitter Display Name: (.+)', block)
                tw_bio = re.search(r'Twitter Bio: (.*?)(?=\nLocation:)', block, re.DOTALL)
                tw_followers = re.search(r'Followers: ([\d,]+)', block)
                match_method = re.search(r'Match Method: (\S+)', block)
                baseline_conf = re.search(r'Baseline Confidence: ([\d.]+)', block)
                candidates.append({
                    'index': int(idx),
                    'twitter_username': tw_username.group(1) if tw_username else '',
                    'twitter_name': tw_name.group(1).strip() if tw_name else '',
                    'twitter_bio': tw_bio.group(1).strip() if tw_bio else '',
                    'twitter_followers': tw_followers.group(1) if tw_followers else '0',
                    'match_method': match_method.group(1) if match_method else '',
                    'baseline_confidence': float(baseline_conf.group(1)) if baseline_conf else 0.0,
                })
            # Extract LLM response (handle multiline JSON with nested structures)
            llm_match = re.search(r'LLM RESPONSE:\s*-+\s*(\{.*)', section, re.DOTALL)
            if llm_match:
                try:
                    # Extract JSON - it should be everything after "LLM RESPONSE:" until end of section
                    json_text = llm_match.group(1)
                    # Find the JSON object (balanced braces)
                    brace_count = 0
                    json_end = 0
                    for i, char in enumerate(json_text):
                        if char == '{':
                            brace_count += 1
                        elif char == '}':
                            brace_count -= 1
                            if brace_count == 0:
                                json_end = i + 1
                                break
                    if json_end > 0:
                        json_str = json_text[:json_end]
                        llm_result = json.loads(json_str)
                    else:
                        llm_result = {'candidates': []}
                except Exception as e:
                    llm_result = {'candidates': []}
            else:
                llm_result = {'candidates': []}
            entries.append({
                'telegram_username': tg_username,
                'telegram_id': tg_id,
                'telegram_bio': tg_bio,
                'username_specificity': specificity,
                'candidates': candidates,
                'llm_results': llm_result.get('candidates', [])
            })
        return entries
    def is_company_account(self, bio: str, name: str) -> Tuple[bool, str]:
        """Detect if this is a company/product/team account"""
        text = (bio + ' ' + name).lower()
        for pattern in self.COMPANY_INDICATORS:
            if re.search(pattern, text, re.IGNORECASE):
                return True, f"Company pattern: '{pattern}'"
        # Check if name equals bio description
        if bio and len(bio.split()) < 20:
            # Short bio describing what something "is"
            if re.search(r'\bis\s+(a|an|the)\s+', bio, re.IGNORECASE):
                return True, "Bio describes a product/service"
        return False, ""
    def is_personal_account(self, bio: str) -> bool:
        """Detect personal account indicators"""
        for pattern in self.PERSONAL_INDICATORS:
            if re.search(pattern, bio, re.IGNORECASE):
                return True
        return False
    def has_crypto_context(self, bio: str) -> Tuple[bool, List[str]]:
        """Check if bio has crypto/web3 context"""
        if not bio:
            return False, []
        bio_lower = bio.lower()
        found_keywords = []
        for keyword in self.CRYPTO_KEYWORDS:
            if keyword in bio_lower:
                found_keywords.append(keyword)
        return len(found_keywords) > 0, found_keywords
    def review_match(self, entry: Dict) -> Dict:
        """Critically review a single match"""
        issues = []
        severity = 'GOOD'
        tg_username = entry['telegram_username']
        tg_bio = entry['telegram_bio']
        tg_has_crypto, tg_crypto_keywords = self.has_crypto_context(tg_bio)
        # Review each LLM-approved match (confidence >= 0.5)
        for llm_result in entry['llm_results']:
            confidence = llm_result.get('confidence', 0)
            if confidence < 0.5:
                continue
            self.stats['total_matches'] += 1
            candidate_idx = llm_result.get('candidate_index', 0) - 1
            if candidate_idx < 0 or candidate_idx >= len(entry['candidates']):
                continue
            candidate = entry['candidates'][candidate_idx]
            tw_username = candidate['twitter_username']
            tw_bio = candidate['twitter_bio']
            tw_name = candidate['twitter_name']
            match_method = candidate['match_method']
            # Check 1: Company account
            is_company, company_reason = self.is_company_account(tw_bio, tw_name)
            if is_company:
                issues.append({
                    'type': 'COMPANY_ACCOUNT',
                    'severity': 'HIGH',
                    'description': f"Twitter @{tw_username} appears to be a company/product account",
                    'evidence': company_reason,
                    'confidence': confidence
                })
                self.stats['company_accounts'] += 1
                severity = 'FALSE_POSITIVE'
            # Check 2: Context mismatch
            tw_has_crypto, tw_crypto_keywords = self.has_crypto_context(tw_bio)
            if tg_has_crypto and not tw_has_crypto:
                issues.append({
                    'type': 'CONTEXT_MISMATCH',
                    'severity': 'MEDIUM',
                    'description': f"TG has crypto context but TW doesn't",
                    'evidence': f"TG keywords: {tg_crypto_keywords}, TW keywords: none",
                    'confidence': confidence
                })
                self.stats['context_mismatch'] += 1
                if severity == 'GOOD':
                    severity = 'QUESTIONABLE'
            # Check 3: Empty bio with no strong evidence
            if not tg_bio and not tw_bio and confidence > 0.8:
                issues.append({
                    'type': 'WEAK_EVIDENCE',
                    'severity': 'MEDIUM',
                    'description': f"High confidence ({confidence}) with both bios empty",
                    'evidence': f"Only username match, no contextual verification",
                    'confidence': confidence
                })
                self.stats['weak_evidence'] += 1
                if severity == 'GOOD':
                    severity = 'QUESTIONABLE'
            # Check 4: Generic username with high confidence
            if entry['username_specificity'] < 0.6 and confidence > 0.85:
                issues.append({
                    'type': 'GENERIC_USERNAME',
                    'severity': 'LOW',
                    'description': f"Generic username ({entry['username_specificity']:.2f} specificity) with high confidence",
                    'evidence': f"Username: {tg_username}",
                    'confidence': confidence
                })
                if severity == 'GOOD':
                    severity = 'QUESTIONABLE'
            # Check 5: Twitter bio mentions other accounts
            if match_method == 'twitter_bio_has_telegram':
                # Check if the telegram username appears as @mention (not the account itself)
                mentions = re.findall(r'@(\w+)', tw_bio)
                if tg_username.lower() not in [m.lower() for m in mentions]:
                    # The username is embedded in another handle
                    issues.append({
                        'type': 'SUBSTRING_MATCH',
                        'severity': 'HIGH',
                        'description': f"TG username found as substring in other accounts, not direct mention",
                        'evidence': f"TW bio: {tw_bio[:100]}",
                        'confidence': confidence
                    })
                    self.stats['false_positives'] += 1
                    severity = 'FALSE_POSITIVE'
        # Count severity
        if severity == 'FALSE_POSITIVE':
            self.stats['false_positives'] += 1
        elif severity == 'QUESTIONABLE':
            self.stats['questionable'] += 1
        else:
            self.stats['good_matches'] += 1
        return {
            'telegram_username': tg_username,
            'telegram_id': entry['telegram_id'],
            'severity': severity,
            'issues': issues,
            'entry': entry
        }
    def generate_report(self, reviews: List[Dict]):
        """Generate comprehensive review report"""
        print()
        print("=" * 100)
        print("🔍 MATCH QUALITY REVIEW REPORT")
        print("=" * 100)
        print()
        print("📊 STATISTICS:")
        print(f"   Total matches reviewed: {self.stats['total_matches']}")
        print(f"   ✅ Good matches: {self.stats['good_matches']} ({self.stats['good_matches']/max(self.stats['total_matches'],1)*100:.1f}%)")
        print(f"   ⚠️  Questionable: {self.stats['questionable']} ({self.stats['questionable']/max(self.stats['total_matches'],1)*100:.1f}%)")
        print(f"   ❌ False positives: {self.stats['false_positives']} ({self.stats['false_positives']/max(self.stats['total_matches'],1)*100:.1f}%)")
        print()
        print("🚨 ISSUE BREAKDOWN:")
        print(f"   Company accounts: {self.stats['company_accounts']}")
        print(f"   Context mismatches: {self.stats['context_mismatch']}")
        print(f"   Weak evidence: {self.stats['weak_evidence']}")
        print()
        # Show false positives
        false_positives = [r for r in reviews if r['severity'] == 'FALSE_POSITIVE']
        if false_positives:
            print("=" * 100)
            print("❌ FALSE POSITIVES:")
            print("=" * 100)
            for review in false_positives[:10]:  # Show top 10
                print()
                print(f"TG @{review['telegram_username']} (ID: {review['telegram_id']})")
                print(f"TG Bio: {review['entry']['telegram_bio'][:100]}")
                for issue in review['issues']:
                    print(f"  ❌ [{issue['severity']}] {issue['type']}: {issue['description']}")
                    print(f"     Evidence: {issue['evidence'][:150]}")
                    print(f"     LLM Confidence: {issue['confidence']:.2f}")
        # Show questionable matches
        questionable = [r for r in reviews if r['severity'] == 'QUESTIONABLE']
        if questionable:
            print()
            print("=" * 100)
            print("⚠️  QUESTIONABLE MATCHES:")
            print("=" * 100)
            for review in questionable[:10]:  # Show top 10
                print()
                print(f"TG @{review['telegram_username']} (ID: {review['telegram_id']})")
                for issue in review['issues']:
                    print(f"  ⚠️  [{issue['severity']}] {issue['type']}: {issue['description']}")
                    print(f"     Evidence: {issue['evidence'][:150]}")
                    print(f"     LLM Confidence: {issue['confidence']:.2f}")
        print()
        print("=" * 100)
        print("💡 RECOMMENDATIONS:")
        print("=" * 100)
        print()
        if self.stats['company_accounts'] > 0:
            print("1. Add company account detection to prompt:")
            print("   - Check for product descriptions ('X is a platform for...')")
            print("   - Look for 'official', 'team', 'hosted by' patterns")
            print("   - Distinguish personal vs organizational accounts")
            print()
        if self.stats['context_mismatch'] > 0:
            print("2. Strengthen context matching:")
            print("   - Require crypto/web3 keywords in both profiles")
            print("   - Lower confidence when contexts don't align")
            print()
        if self.stats['weak_evidence'] > 0:
            print("3. Adjust confidence for weak evidence:")
            print("   - Cap confidence at 0.70 when both bios are empty")
            print("   - Require additional signals beyond username match")
            print()
        print("4. Fix 'twitter_bio_has_telegram' method:")
        print("   - Only match direct @mentions, not substrings in other handles")
        print("   - Example: @hipster should NOT match mentions of @HipsterHacker")
        print()
 def main():
    reviewer = MatchReviewer()
    print("📖 Parsing verification log...")
    entries = reviewer.parse_log()
    print(f"✅ Parsed {len(entries)} verification entries")
    print()
    print("🔍 Reviewing match quality...")
    reviews = []
    for entry in entries:
        if entry['llm_results']:  # Only review entries with matches
            review = reviewer.review_match(entry)
            reviews.append(review)
    print(f"✅ Reviewed {len(reviews)} matches")
    reviewer.generate_report(reviews)
 if __name__ == "__main__":
    main()
--- a/setup_twitter_matching_schema.sql
+++ b/setup_twitter_matching_schema.sql
@@ -0,0 +1,101 @@
 -- Twitter-Telegram Matching Schema
 -- Run this against your telegram_contacts database
 -- Table: twitter_telegram_matches
 -- Stores confirmed matches between Telegram and Twitter profiles
 CREATE TABLE IF NOT EXISTS twitter_telegram_matches (
    id SERIAL PRIMARY KEY,
    -- Telegram side
    telegram_user_id BIGINT NOT NULL REFERENCES contacts(user_id),
    telegram_username VARCHAR(255),
    telegram_name VARCHAR(255),
    telegram_bio TEXT,
    -- Twitter side
    twitter_id VARCHAR(50) NOT NULL,
    twitter_username VARCHAR(100) NOT NULL,
    twitter_name VARCHAR(200),
    twitter_bio TEXT,
    twitter_location VARCHAR(200),
    twitter_verified BOOLEAN,
    twitter_blue_verified BOOLEAN,
    twitter_followers_count INTEGER,
    -- Matching metadata
    match_method VARCHAR(50) NOT NULL,  -- 'exact_bio_handle', 'exact_username', 'username_variation', 'fuzzy_name'
    baseline_confidence FLOAT NOT NULL, -- Confidence before LLM (0-1)
    llm_verdict VARCHAR(20) NOT NULL,   -- 'YES', 'NO', 'UNSURE'
    final_confidence FLOAT NOT NULL CHECK (final_confidence BETWEEN 0 AND 1),
    -- Match details (JSON for flexibility)
    match_details JSONB,  -- {extracted_handles: [...], username_variation: 'xxx', fuzzy_score: 0.85}
    -- LLM metadata
    llm_tokens_used INTEGER,
    llm_cost FLOAT,
    -- Audit trail
    matched_at TIMESTAMP DEFAULT NOW(),
    needs_manual_review BOOLEAN DEFAULT FALSE,
    verified_manually BOOLEAN DEFAULT FALSE,
    manual_review_notes TEXT,
    UNIQUE(telegram_user_id, twitter_id)
 );
 -- Indexes for performance
 CREATE INDEX IF NOT EXISTS idx_ttm_telegram_user ON twitter_telegram_matches(telegram_user_id);
 CREATE INDEX IF NOT EXISTS idx_ttm_twitter_id ON twitter_telegram_matches(twitter_id);
 CREATE INDEX IF NOT EXISTS idx_ttm_twitter_username ON twitter_telegram_matches(LOWER(twitter_username));
 CREATE INDEX IF NOT EXISTS idx_ttm_confidence ON twitter_telegram_matches(final_confidence DESC);
 CREATE INDEX IF NOT EXISTS idx_ttm_verdict ON twitter_telegram_matches(llm_verdict);
 CREATE INDEX IF NOT EXISTS idx_ttm_method ON twitter_telegram_matches(match_method);
 CREATE INDEX IF NOT EXISTS idx_ttm_needs_review ON twitter_telegram_matches(needs_manual_review) WHERE needs_manual_review = TRUE;
 -- Table: twitter_match_candidates (temporary staging)
 -- Stores potential matches before LLM verification
 CREATE TABLE IF NOT EXISTS twitter_match_candidates (
    id SERIAL PRIMARY KEY,
    telegram_user_id BIGINT NOT NULL REFERENCES contacts(user_id),
    -- Twitter candidate info
    twitter_id VARCHAR(50) NOT NULL,
    twitter_username VARCHAR(100) NOT NULL,
    twitter_name VARCHAR(200),
    twitter_bio TEXT,
    twitter_location VARCHAR(200),
    twitter_verified BOOLEAN,
    twitter_blue_verified BOOLEAN,
    twitter_followers_count INTEGER,
    -- Candidate scoring
    candidate_rank INTEGER,              -- 1 = best match, 2 = second best, etc.
    match_method VARCHAR(50),
    baseline_confidence FLOAT,
    match_signals JSONB,                 -- {'handle_match': true, 'fuzzy_score': 0.85, ...}
    -- LLM processing status
    needs_llm_review BOOLEAN DEFAULT TRUE,
    llm_processed BOOLEAN DEFAULT FALSE,
    llm_verdict VARCHAR(20),
    final_confidence FLOAT,
    created_at TIMESTAMP DEFAULT NOW()
 );
 CREATE INDEX IF NOT EXISTS idx_tmc_telegram_user ON twitter_match_candidates(telegram_user_id);
 CREATE INDEX IF NOT EXISTS idx_tmc_twitter_id ON twitter_match_candidates(twitter_id);
 CREATE INDEX IF NOT EXISTS idx_tmc_needs_review ON twitter_match_candidates(llm_processed, needs_llm_review)
    WHERE needs_llm_review = TRUE AND llm_processed = FALSE;
 CREATE INDEX IF NOT EXISTS idx_tmc_rank ON twitter_match_candidates(telegram_user_id, candidate_rank);
 -- Grant permissions (adjust as needed for your setup)
 -- GRANT ALL PRIVILEGES ON twitter_telegram_matches TO your_user;
 -- GRANT ALL PRIVILEGES ON twitter_match_candidates TO your_user;
 -- GRANT USAGE, SELECT ON SEQUENCE twitter_telegram_matches_id_seq TO your_user;
 -- GRANT USAGE, SELECT ON SEQUENCE twitter_match_candidates_id_seq TO your_user;
 COMMENT ON TABLE twitter_telegram_matches IS 'Confirmed matches between Telegram and Twitter profiles';
 COMMENT ON TABLE twitter_match_candidates IS 'Temporary staging for potential matches awaiting LLM verification';
--- a/verify_twitter_matches_v2.py
+++ b/verify_twitter_matches_v2.py
@@ -0,0 +1,791 @@
 #!/usr/bin/env python3
 """
 Twitter-Telegram Match Verifier V2 (Confidence-Based with Batch Evaluation)
 Uses LLM with confidence scoring (0-1) and evaluates all candidates for a TG user together
 """
 import sys
 import asyncio
 import json
 from pathlib import Path
 from typing import List, Dict
 import psycopg2
 from psycopg2.extras import DictCursor, RealDictCursor
 import openai
 from datetime import datetime
 import re
 # Add parent directory to path
 sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
 from db_config import SessionLocal
 from models import Contact
 # Twitter database connection
 TWITTER_DB_CONFIG = {
    'dbname': 'twitter_data',
    'user': 'andrewjiang',
    'host': 'localhost',
    'port': 5432
 }
 # Checkpoint file
 CHECKPOINT_FILE = Path(__file__).parent.parent / 'llm_verification_v2_checkpoint.json'
 class CheckpointManager:
    """Manage checkpoints for resumable verification"""
    def __init__(self, checkpoint_file):
        self.checkpoint_file = checkpoint_file
        self.data = self.load()
    def load(self):
        if self.checkpoint_file.exists():
            with open(self.checkpoint_file, 'r') as f:
                return json.load(f)
        return {
            'last_processed_telegram_id': None,
            'processed_count': 0,
            'total_matches_saved': 0,
            'total_cost': 0.0,
            'total_tokens': 0,
            'started_at': None
        }
    def save(self):
        self.data['last_updated_at'] = datetime.now().isoformat()
        with open(self.checkpoint_file, 'w') as f:
            json.dump(self.data, f, indent=2)
    def update(self, telegram_user_id, matches_saved, tokens_used, cost):
        self.data['last_processed_telegram_id'] = telegram_user_id
        self.data['processed_count'] += 1
        self.data['total_matches_saved'] += matches_saved
        self.data['total_tokens'] += tokens_used
        self.data['total_cost'] += cost
        # Save every 20 telegram users
        if self.data['processed_count'] % 20 == 0:
            self.save()
 def calculate_username_specificity(username: str) -> float:
    """
    Calculate how specific/unique a username is (0-1)
    Generic usernames get lower scores, unique ones get higher scores
    """
    if not username:
        return 0.5
    username_lower = username.lower()
    # Very generic patterns
    generic_patterns = [
        r'^admin\d*$', r'^user\d+$', r'^crypto\d*$', r'^nft\d*$',
        r'^web3\d*$', r'^trader\d*$', r'^dev\d*$', r'^official\d*$',
        r'^team\d*$', r'^support\d*$', r'^info\d*$'
    ]
    for pattern in generic_patterns:
        if re.match(pattern, username_lower):
            return 0.3
    # Length-based scoring
    length = len(username)
    if length < 4:
        return 0.4
    elif length < 6:
        return 0.6
    elif length < 8:
        return 0.75
    else:
        return 0.9
 class LLMVerifier:
    """Use GPT-5 Nano for confidence-based verification"""
    def __init__(self):
        self.client = openai.AsyncOpenAI()
        self.model = "gpt-5-mini"  # GPT-5 Mini - balanced performance and cost
        # Cost calculation for gpt-5-mini
        self.input_cost_per_1m = 0.25  # $0.25 per 1M input tokens
        self.output_cost_per_1m = 2.00  # $2.00 per 1M output tokens
    def build_batch_prompt(self, telegram_profile: Dict, candidates: List[Dict]) -> str:
        """Build LLM prompt for batch evaluation of all candidates"""
        # Format Telegram profile
        tg_bio = telegram_profile.get('bio', 'none')
        if tg_bio and len(tg_bio) > 300:
            tg_bio = tg_bio[:300] + '...'
        # Format chat context
        chat_context = telegram_profile.get('chat_context', {})
        chat_info = ""
        if chat_context.get('chat_titles'):
            chat_info = f"\nGroup Chats: {chat_context['chat_titles']}"
            if chat_context.get('is_crypto_focused'):
                chat_info += " [CRYPTO-FOCUSED GROUPS]"
        prompt = f"""TELEGRAM PROFILE:
 Username: @{telegram_profile.get('username') or 'none'}
 Display Name: {telegram_profile.get('name', 'none')} (first_name + last_name combined)
 Bio: {tg_bio}{chat_info}
 TWITTER CANDIDATES (evaluate all together):
 """
        for i, candidate in enumerate(candidates, 1):
            tw_bio = candidate.get('twitter_bio', 'none')
            if tw_bio and len(tw_bio) > 250:
                tw_bio = tw_bio[:250] + '...'
            # Check if phash match info exists
            phash_info = ""
            if candidate['match_method'] == 'phash_match':
                phash_distance = candidate.get('match_signals', {}).get('phash_distance', 'unknown')
                phash_info = f"\n⭐ Phash Match: distance={phash_distance} (identical profile pictures!)"
            prompt += f"""
 [Candidate {i}]
 Twitter Username: @{candidate['twitter_username']}
 Twitter Display Name: {candidate.get('twitter_name', 'Unknown')}
 Twitter Bio: {tw_bio}
 Location: {candidate.get('twitter_location') or 'none'}
 Verified: {candidate.get('twitter_verified', False)} (Blue: {candidate.get('twitter_blue_verified', False)})
 Followers: {candidate.get('twitter_followers_count', 0):,}
 Match Method: {candidate['match_method']}{phash_info}
 Baseline Confidence: {candidate.get('baseline_confidence', 0):.2f}
 """
        return prompt
    async def verify_batch(self, telegram_profile: Dict, candidates: List[Dict], semaphore, log_file=None) -> Dict:
        """Verify all candidates for a single Telegram user"""
        async with semaphore:
            prompt = self.build_batch_prompt(telegram_profile, candidates)
            if log_file:
                log_file.write(f"\n{'=' * 100}\n")
                log_file.write(f"TELEGRAM USER: {telegram_profile.get('username', 'N/A')} (ID: {telegram_profile['user_id']})\n")
                log_file.write(f"{'=' * 100}\n\n")
            system_prompt = """You are an expert at determining if two social media profiles belong to the same person.
 # TASK
 Determine confidence (0.0-1.0) that each Twitter candidate is the same person as the Telegram profile.
 **CRITICAL: Evaluate ALL candidates together, not in isolation. Compare them against each other to identify which has the STRONGEST evidence.**
 # SIGNAL STRENGTH GUIDE
 Evaluate the FULL CONTEXT of all available signals together. Individual signals can be strong or weak, but the overall picture matters most.
 ## VERY STRONG SIGNALS (Can individually suggest high confidence)
 - **Explicit bio mention**: TG bio says "x.com/username" or "Follow me @username"
  - ⚠️ EXCEPTION: If that account is clearly a company/project (not personal), this is NOT definitive
  - Example: TG bio "x.com/gems_gun" but @gems_gun is company account → Look for personal account like @lucahl0 "Building @gems_gun"
 - **Unique username exact match**: Unusual/long username (like @kupermind, @schellinger_k) that matches exactly
  - Generic usernames (@mike, @crypto123) don't qualify as "unique"
 ## STRONG SUPPORTING SIGNALS (Good indicators when combined)
 Each of these helps build confidence, especially when multiple align:
 - Full name match (after normalization: remove .eth, emojis, separators)
 - Same profile picture (phash match)
 - Aligned bio themes/context (both in crypto, both mention same projects/interests)
 - Very similar username (not exact, but close: @kevin vs @k_kevin)
 ## WEAK SIGNALS (Need multiple strong signals to be meaningful)
 - Generic name match only (Alex, Mike, David, John, Baz)
 - Same general field but no specifics
 - Partial username similarity with generic name
 ## RED FLAGS (Lower confidence significantly)
 - Context mismatch: TG is crypto/tech, TW is chef/athlete/journalist
 - Company account when looking for personal profile
 - Famous person/celebrity (unless clear evidence it's actually them)
 # CONFIDENCE BANDS
 ## 0.90-1.0: NEARLY CERTAIN
 Very strong signal (bio mention of personal account OR unique username match) + supporting signals align
 Examples:
 - TG bio: "https://x.com/kupermind" + TW @kupermind personal account → 0.97
 - TG @olliten + TW @olliten + same name + same pic → 0.98
 ## 0.70-0.89: LIKELY
 Multiple strong supporting signals converge, or one very strong signal with some gap
 Examples:
 - Very similar username + name match + context: @schellinger → @k_schellinger "Kevin Schellinger" → 0.85
 - Exact username on moderately unique name: @alexcrypto + "Alex Smith" crypto → 0.78
 ## 0.40-0.69: POSSIBLE
 Some evidence but significant uncertainty
 - Generic name + same field but no username/pic match
 - Weak username similarity with generic name
 - Profile pic match but name is very generic
 ## 0.10-0.39: UNLIKELY
 Minimal evidence or contradictions
 - Only generic name match (David, Alex, Mike)
 - Context mismatch (crypto person vs chef)
 ## 0.0-0.09: EXTREMELY UNLIKELY
 No meaningful evidence or clear contradiction
 # COMPARATIVE EVALUATION PROCESS
 **Step 1: Review ALL candidates together**
 Don't score each in isolation. Look at the full set to understand which has the strongest evidence.
 **Step 2: Identify the strongest signals present**
 - Is there a bio mention? (Check if it's personal vs company account!)
 - Is there a unique username match?
 - Do multiple supporting signals converge for one candidate?
 **Step 3: Apply differential scoring**
 - The candidate with STRONGEST evidence should get meaningfully higher score
 - If Candidate A has unique username + name match, and Candidate B only has generic name → A gets 0.85+, B gets 0.40 max
 - If ALL candidates only have weak signals (generic name only) → ALL score 0.20-0.40
 **Step 4: Sanity checks**
 - Could this evidence match thousands of people? → Lower confidence
 - Is there a context mismatch? → Max 0.50
 - Is this a company account when we need personal? → Not the right match
 **Key principle: Only ONE candidate can be "most likely" - differentiate clearly between them.**
 # TECHNICAL NOTES
 **Name Normalization**: Before comparing, remove .eth/.ton/.sol suffixes, emojis, "| company" separators, and ignore capitalization
 **Profile Picture (phash)**: Phash match alone → MAX 0.70 (supporting signal). Use to break ties or add confidence to other signals.
 # OUTPUT FORMAT
 Return ONLY valid JSON (no markdown, no explanation):
 {{
  "candidates": [
    {{
      "candidate_index": 1,
      "confidence": 0.85,
      "reasoning": "Brief explanation"
    }},
    ...
  ]
 }}"""
            try:
                if log_file:
                    log_file.write("SYSTEM PROMPT:\n")
                    log_file.write("-" * 100 + "\n")
                    log_file.write(system_prompt + "\n\n")
                    log_file.write("USER PROMPT:\n")
                    log_file.write("-" * 100 + "\n")
                    log_file.write(prompt + "\n\n")
                    log_file.flush()
                response = await self.client.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": prompt}
                    ],
                    response_format={"type": "json_object"}
                )
                content = response.choices[0].message.content.strip()
                if log_file:
                    log_file.write("LLM RESPONSE:\n")
                    log_file.write("-" * 100 + "\n")
                    log_file.write(content + "\n\n")
                    log_file.flush()
                # Parse JSON
                try:
                    result = json.loads(content)
                except json.JSONDecodeError:
                    print(f"      ⚠️  Failed to parse JSON response")
                    return {
                        'success': False,
                        'error': 'json_parse_error',
                        'tokens_used': response.usage.total_tokens,
                        'cost': self.calculate_cost(response.usage)
                    }
                tokens_used = response.usage.total_tokens
                cost = self.calculate_cost(response.usage)
                return {
                    'success': True,
                    'results': result.get('candidates', []),
                    'tokens_used': tokens_used,
                    'cost': cost,
                    'error': None
                }
            except Exception as e:
                print(f"      ⚠️  LLM error: {str(e)[:100]}")
                return {
                    'success': False,
                    'error': str(e),
                    'tokens_used': 0,
                    'cost': 0.0
                }
    def calculate_cost(self, usage) -> float:
        """Calculate cost for this API call"""
        input_cost = (usage.prompt_tokens / 1_000_000) * self.input_cost_per_1m
        output_cost = (usage.completion_tokens / 1_000_000) * self.output_cost_per_1m
        return input_cost + output_cost
 def get_telegram_users_with_candidates(telegram_conn, checkpoint_manager, limit=None):
    """Get list of telegram_user_ids that have unprocessed candidates"""
    with telegram_conn.cursor() as cur:
        query = """
            SELECT DISTINCT telegram_user_id
            FROM twitter_match_candidates
            WHERE needs_llm_review = TRUE
              AND llm_processed = FALSE
        """
        if checkpoint_manager.data['last_processed_telegram_id']:
            query += f" AND telegram_user_id > {checkpoint_manager.data['last_processed_telegram_id']}"
        query += " ORDER BY telegram_user_id"
        if limit:
            query += f" LIMIT {limit}"
        cur.execute(query)
        return [row[0] for row in cur.fetchall()]
 def get_candidates_for_telegram_user(telegram_user_id: int, telegram_conn):
    """Get all candidates for a specific Telegram user"""
    with telegram_conn.cursor(cursor_factory=RealDictCursor) as cur:
        cur.execute("""
            SELECT *
            FROM twitter_match_candidates
            WHERE telegram_user_id = %s
              AND needs_llm_review = TRUE
              AND llm_processed = FALSE
            ORDER BY baseline_confidence DESC
        """, (telegram_user_id,))
        return [dict(row) for row in cur.fetchall()]
 def get_user_chat_context(user_id: int, telegram_conn) -> Dict:
    """Get chat participation context for a user"""
    with telegram_conn.cursor(cursor_factory=RealDictCursor) as cur:
        cur.execute("""
            SELECT
                STRING_AGG(DISTINCT c.title, ' | ') FILTER (WHERE c.title IS NOT NULL) as chat_titles,
                COUNT(DISTINCT cp.chat_id) as chat_count
            FROM chat_participants cp
            JOIN chats c ON cp.chat_id = c.chat_id
            WHERE cp.user_id = %s
              AND c.title IS NOT NULL
              AND c.chat_type != 'private'
        """, (user_id,))
        result = cur.fetchone()
        if result and result['chat_titles']:
            # Check if chats indicate crypto/web3 interest
            chat_titles_lower = result['chat_titles'].lower()
            crypto_keywords = ['crypto', 'bitcoin', 'eth', 'defi', 'nft', 'dao', 'web3', 'blockchain',
                             'solana', 'near', 'avalanche', 'polygon', 'base', 'arbitrum', 'optimism',
                             'cosmos', 'builders', 'degen', 'lobster']
            is_crypto_focused = any(keyword in chat_titles_lower for keyword in crypto_keywords)
            return {
                'chat_titles': result['chat_titles'],
                'chat_count': result['chat_count'],
                'is_crypto_focused': is_crypto_focused
            }
        return {'chat_titles': None, 'chat_count': 0, 'is_crypto_focused': False}
 async def verify_telegram_user(telegram_user_id: int, verifier: LLMVerifier, checkpoint_manager: CheckpointManager,
                                telegram_db, telegram_conn, log_file=None) -> Dict:
    """Verify all candidates for a single Telegram user"""
    # Get Telegram profile
    telegram_profile = telegram_db.query(Contact).filter(
        Contact.user_id == telegram_user_id
    ).first()
    if not telegram_profile:
        return {'matches_saved': 0, 'tokens': 0, 'cost': 0}
    # Construct display name from first_name + last_name
    display_name = (telegram_profile.first_name or '') + (' ' + telegram_profile.last_name if telegram_profile.last_name else '')
    display_name = display_name.strip() or None
    # Get chat participation context
    chat_context = get_user_chat_context(telegram_user_id, telegram_conn)
    telegram_dict = {
        'user_id': telegram_profile.user_id,
        'account_id': telegram_profile.account_id,
        'username': telegram_profile.username,
        'name': display_name,
        'bio': telegram_profile.bio,
        'chat_context': chat_context
    }
    # Get all candidates
    candidates = get_candidates_for_telegram_user(telegram_user_id, telegram_conn)
    if not candidates:
        return {'matches_saved': 0, 'tokens': 0, 'cost': 0}
    # Verify with LLM
    semaphore = asyncio.Semaphore(1)  # One at a time for now
    llm_result = await verifier.verify_batch(telegram_dict, candidates, semaphore, log_file)
    if not llm_result['success']:
        # Mark as processed even on error so we don't retry infinitely
        mark_candidates_processed([c['id'] for c in candidates], telegram_conn)
        return {
            'matches_saved': 0,
            'tokens': llm_result['tokens_used'],
            'cost': llm_result['cost']
        }
    # Save matches
    matches_saved = save_verified_matches(
        telegram_dict,
        candidates,
        llm_result['results'],
        telegram_conn
    )
    # Mark candidates as processed
    mark_candidates_processed([c['id'] for c in candidates], telegram_conn)
    return {
        'matches_saved': matches_saved,
        'tokens': llm_result['tokens_used'],
        'cost': llm_result['cost']
    }
 def save_verified_matches(telegram_profile: Dict, candidates: List[Dict], llm_results: List[Dict], telegram_conn):
    """Save verified matches with confidence scores"""
    matches_to_save = []
    # CRITICAL: Post-process to fix generic name false positives
    tg_name = telegram_profile.get('name', '')
    tg_username = telegram_profile.get('username', '')
    tg_bio = telegram_profile.get('bio', '')
    # Check if profile has generic characteristics
    has_generic_name = len(tg_name) <= 7  # Short display name
    has_generic_username = len(tg_username) <= 8  # Short username
    has_empty_bio = not tg_bio or len(tg_bio) <= 20
    is_generic_profile = has_empty_bio and (has_generic_name or has_generic_username)
    for llm_result in llm_results:
        candidate_idx = llm_result.get('candidate_index', 0) - 1  # Convert to 0-indexed
        if candidate_idx < 0 or candidate_idx >= len(candidates):
            continue
        confidence = llm_result.get('confidence', 0)
        reasoning = llm_result.get('reasoning', '')
        # Only save if confidence >= 0.5 (moderate or higher)
        if confidence < 0.5:
            continue
        candidate = candidates[candidate_idx]
        # CRITICAL FIX: Cap confidence for generic profiles with weak match methods
        # Weak match methods are those based purely on name/username containment
        weak_match_methods = [
            'display_name_containment',
            'fuzzy_name',
            'tg_username_in_twitter_name',
            'twitter_username_in_tg_name'
        ]
        if is_generic_profile and candidate['match_method'] in weak_match_methods:
            # Cap at 0.70 unless it's a strong signal (phash, exact_username, exact_bio_handle)
            if confidence > 0.70:
                confidence = 0.70
                reasoning += " [Confidence capped at 0.70: generic profile + weak match method]"
        match_details = {
            'match_method': candidate['match_method'],
            'baseline_confidence': candidate['baseline_confidence'],
            'llm_confidence': confidence,
            'llm_reasoning': reasoning
        }
        matches_to_save.append((
            telegram_profile['account_id'],
            telegram_profile['user_id'],
            telegram_profile.get('username'),
            telegram_profile.get('name'),
            telegram_profile.get('bio'),
            candidate['twitter_id'],
            candidate['twitter_username'],
            candidate.get('twitter_name'),
            candidate.get('twitter_bio'),
            candidate.get('twitter_location'),
            candidate.get('twitter_verified', False),
            candidate.get('twitter_blue_verified', False),
            candidate.get('twitter_followers_count', 0),
            candidate['match_method'],
            candidate['baseline_confidence'],
            'CONFIDENT' if confidence >= 0.8 else 'MODERATE' if confidence >= 0.6 else 'UNSURE',
            confidence,
            json.dumps(match_details),
            llm_result.get('tokens_used', 0),
            0,  # cost will be aggregated at the batch level
            confidence < 0.75  # needs_manual_review
        ))
    if matches_to_save:
        with telegram_conn.cursor() as cur:
            cur.executemany("""
                INSERT INTO twitter_telegram_matches (
                    account_id,
                    telegram_user_id,
                    telegram_username,
                    telegram_name,
                    telegram_bio,
                    twitter_id,
                    twitter_username,
                    twitter_name,
                    twitter_bio,
                    twitter_location,
                    twitter_verified,
                    twitter_blue_verified,
                    twitter_followers_count,
                    match_method,
                    baseline_confidence,
                    llm_verdict,
                    final_confidence,
                    match_details,
                    llm_tokens_used,
                    llm_cost,
                    needs_manual_review
                ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                ON CONFLICT (telegram_user_id, twitter_id) DO UPDATE SET
                    llm_verdict = EXCLUDED.llm_verdict,
                    final_confidence = EXCLUDED.final_confidence,
                    matched_at = NOW()
            """, matches_to_save)
            telegram_conn.commit()
    return len(matches_to_save)
 def mark_candidates_processed(candidate_ids: List[int], telegram_conn):
    """Mark candidates as LLM processed"""
    with telegram_conn.cursor() as cur:
        cur.execute("""
            UPDATE twitter_match_candidates
            SET llm_processed = TRUE
            WHERE id = ANY(%s)
        """, (candidate_ids,))
        telegram_conn.commit()
 async def main():
    print()
    print("=" * 70)
    print("🤖 Twitter-Telegram Match Verifier V2 (Confidence-Based)")
    print("=" * 70)
    print()
    # Check arguments
    test_mode = '--test' in sys.argv
    verbose = '--verbose' in sys.argv
    limit = 100 if test_mode else None
    # Check for concurrency argument
    concurrent_requests = 10  # Default
    for i, arg in enumerate(sys.argv):
        if arg == '--concurrent' and i + 1 < len(sys.argv):
            try:
                concurrent_requests = int(sys.argv[i + 1])
            except ValueError:
                pass
    # Setup verbose logging
    log_file = None
    if verbose:
        log_file = open(Path(__file__).parent.parent / 'verification_v2_log.txt', 'w')
        log_file.write("=" * 100 + "\n")
        log_file.write("VERIFICATION V2 LOG\n")
        log_file.write("=" * 100 + "\n\n")
    if test_mode:
        print("🧪 TEST MODE: Processing first 100 Telegram users only")
        print()
    # Initialize checkpoint
    checkpoint_manager = CheckpointManager(CHECKPOINT_FILE)
    if checkpoint_manager.data['last_processed_telegram_id']:
        print(f"📍 Resuming from telegram_user_id {checkpoint_manager.data['last_processed_telegram_id']}")
        print(f"   Already processed: {checkpoint_manager.data['processed_count']:,}")
        print(f"   Cost so far: ${checkpoint_manager.data['total_cost']:.4f}")
        print()
    else:
        checkpoint_manager.data['started_at'] = datetime.now().isoformat()
    # Connect to databases
    print("📡 Connecting to databases...")
    telegram_db = SessionLocal()
    try:
        telegram_conn = psycopg2.connect(dbname='telegram_contacts', user='andrewjiang', host='localhost', port=5432)
        telegram_conn.autocommit = False
    except Exception as e:
        print(f"❌ Failed to connect to Telegram database: {e}")
        return False
    print("✅ Connected")
    print()
    try:
        # Load telegram users needing verification
        print("🔍 Loading telegram users with candidates...")
        telegram_user_ids = get_telegram_users_with_candidates(telegram_conn, checkpoint_manager, limit)
        if not telegram_user_ids:
            print("✅ No users to verify!")
            return True
        print(f"✅ Found {len(telegram_user_ids):,} telegram users to verify")
        print()
        # Estimate cost (rough)
        estimated_cost = len(telegram_user_ids) * 0.003  # ~$0.003 per user
        print(f"💰 Estimated cost: ${estimated_cost:.4f}")
        print()
        # Initialize verifier
        verifier = LLMVerifier()
        print("🚀 Starting LLM verification...")
        print(f"⚡ Concurrent requests: {concurrent_requests}")
        print()
        # Configuration for parallel processing
        CONCURRENT_REQUESTS = concurrent_requests  # Process N users at a time
        BATCH_SIZE = 50  # Save checkpoint every 50 users
        # Process users in parallel batches
        total_users = len(telegram_user_ids)
        processed_count = 0
        for batch_start in range(0, total_users, BATCH_SIZE):
            batch_end = min(batch_start + BATCH_SIZE, total_users)
            batch = telegram_user_ids[batch_start:batch_end]
            print(f"📦 Processing batch {batch_start//BATCH_SIZE + 1}/{(total_users + BATCH_SIZE - 1)//BATCH_SIZE} ({len(batch)} users)...")
            # Create tasks for concurrent processing
            tasks = []
            for telegram_user_id in batch:
                task = verify_telegram_user(
                    telegram_user_id,
                    verifier,
                    checkpoint_manager,
                    telegram_db,
                    telegram_conn,
                    log_file
                )
                tasks.append((telegram_user_id, task))
            # Process batch concurrently with limit
            semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
            async def process_with_semaphore(user_id, task):
                async with semaphore:
                    return user_id, await task
            results = await asyncio.gather(
                *[process_with_semaphore(user_id, task) for user_id, task in tasks],
                return_exceptions=True
            )
            # Process results and update checkpoints
            for i, result in enumerate(results):
                processed_count += 1
                user_id = batch[i]
                if isinstance(result, Exception):
                    print(f"[{processed_count}/{total_users}] ❌ User {user_id} failed: {result}")
                    continue
                user_id_result, verification_result = result
                print(f"[{processed_count}/{total_users}] ✅ User {user_id_result}: {verification_result['matches_saved']} matches | ${verification_result['cost']:.4f}")
                # Update checkpoint
                checkpoint_manager.update(
                    user_id_result,
                    verification_result['matches_saved'],
                    verification_result['tokens'],
                    verification_result['cost']
                )
            print(f"  Batch complete. Total processed: {processed_count}/{total_users}")
            print()
        # Final stats
        print()
        print("=" * 70)
        print("✅ VERIFICATION COMPLETE")
        print("=" * 70)
        print()
        print(f"📊 Statistics:")
        print(f"   Processed: {checkpoint_manager.data['processed_count']:,} telegram users")
        print(f"   💾 Saved matches: {checkpoint_manager.data['total_matches_saved']:,}")
        print()
        print(f"💰 Cost:")
        print(f"   Total tokens: {checkpoint_manager.data['total_tokens']:,}")
        print(f"   Total cost: ${checkpoint_manager.data['total_cost']:.4f}")
        print()
        # Clean up checkpoint
        CHECKPOINT_FILE.unlink(missing_ok=True)
        return True
    except Exception as e:
        print(f"❌ Error: {e}")
        import traceback
        traceback.print_exc()
        return False
    finally:
        if log_file:
            log_file.close()
        telegram_db.close()
        telegram_conn.close()
        checkpoint_manager.save()
 if __name__ == "__main__":
    try:
        success = asyncio.run(main())
        sys.exit(0 if success else 1)
    except KeyboardInterrupt:
        print("\n\n⚠️  Interrupted by user")
        print("💾 Progress saved - you can resume by running this script again")
        sys.exit(1)