Initial commit: Twitter-Telegram Profile Matching System

This module provides comprehensive Twitter-to-Telegram profile matching
and verification using 10 different matching methods and LLM verification.

Features:
- 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names)
- URL resolution integration for t.co → t.me links
- Async LLM verification with GPT-5-mini
- Interactive menu system with real-time stats
- Threaded candidate finding (~1.5 contacts/sec)
- Comprehensive documentation and guides

Key Components:
- find_twitter_candidates.py: Core matching logic (10 methods)
- find_twitter_candidates_threaded.py: Threaded implementation
- verify_twitter_matches_v2.py: LLM verification (V5 prompt)
- review_match_quality.py: Analysis and quality review
- main.py: Interactive menu system
- Complete documentation (README, CHANGELOG, QUICKSTART)

Performance:
- Candidate finding: ~16-18 hours for 43K contacts
- LLM verification: ~23 hours for 43K users
- Cost: ~$130 for full verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Andrew Jiang
2025-11-04 22:56:25 -08:00
commit 5319d4d868
10 changed files with 3394 additions and 0 deletions

47
.gitignore vendored Normal file
View File

@@ -0,0 +1,47 @@
# Environment variables
.env
.env.local
.env.*.local
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
venv/
ENV/
build/
dist/
*.egg-info/
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# Logs
*.log
*.out
*.pid
# Database
*.db
*.sqlite
*.sqlite3
# Checkpoints and temp files
*_checkpoint.json
*.tmp
*.temp
# OS
.DS_Store
Thumbs.db
# Test outputs
test_output/
*.test

159
CHANGELOG.md Normal file
View File

@@ -0,0 +1,159 @@
# ProfileMatching Changelog
## 2025-11-04 - Initial Module Creation & URL Resolution Integration
### Module Organization
- Created dedicated `ProfileMatching/` folder within UnifiedContacts
- Bundled all matching and verification scripts together
- Added comprehensive interactive `main.py` with menu system
- Added detailed `README.md` documentation
### Major Enhancement: URL Resolution Integration
**Problem**: Missing matches where Twitter usernames differ from Telegram usernames, even when Twitter profile explicitly links to Telegram.
**Solution**: Integrated `url_resolution_queue` table data into candidate finding.
**Implementation**:
- Added new method `find_by_resolved_url()` to TwitterMatcher class (find_twitter_candidates.py:339-378)
- Queries Twitter profiles where bio URLs (t.co/xyz) resolve to t.me/{username}
- Integrated as Method 5b in candidate finding pipeline (find_twitter_candidates.py:554-574)
- Baseline confidence: 0.95 (very high - explicit link in bio)
**Impact**:
- 140+ potential new matches identified
- Captures matches like Twitter @Block_Flash → Telegram @bull_flash
- Especially valuable when usernames differ but user explicitly links profiles
**Example Match**:
```
Twitter: @Block_Flash (ID: 63429948)
Telegram: @bull_flash
Method: twitter_bio_url_resolves_to_telegram
Confidence: 0.95
Resolved URL: https://t.me/bull_flash
Original URL: https://t.co/dc3iztSG9B
```
### URL Resolution Data Source
The `url_resolution_queue` table contains:
- 16,133 Twitter profiles with resolved Telegram URLs
- Shortened URLs from Twitter bios (t.co/xyz)
- Resolved destinations (full URLs)
- Extracted Telegram handles (JSONB array)
### Files Modified
1. **find_twitter_candidates.py**
- Added `find_by_resolved_url()` method
- Integrated into `find_candidates_for_contact()` as Method 5b
- Fixed type casting for twitter_id (VARCHAR) to user.id (BIGINT) join
2. **main.py** (ProfileMatching module)
- Created comprehensive interactive menu
- Real-time statistics display
- Streamlined workflow for candidate finding and LLM verification
3. **README.md**
- Complete documentation of all 10 matching methods
- Usage examples and performance metrics
- Configuration and troubleshooting guides
4. **UnifiedContacts/main.py**
- Added option 15: "Open Profile Matching System (Interactive Menu)"
- Reorganized menu to separate ProfileMatching from individual Twitter steps
### Current Statistics (as of 2025-11-04)
**Candidates**:
- Users with candidates: 38,121
- Total candidates found: 253,117
- Processed by LLM: 253,085
- Pending verification: 32
**Verified Matches**:
- Users with matches: 25,662
- Total matches: 36,147
- Average confidence: 0.74
- High confidence (90%+): 12,031
- Medium confidence (80-89%): 1,452
- Low confidence (70-79%): 7,505
### LLM Verification (V6 Prompt)
Current prompt improvements:
- Upfront directive for comparative evaluation
- Clear signal strength hierarchy (Very Strong, Strong Supporting, Weak, Red Flags)
- Company vs personal account differentiation
- Streamlined from ~135 to ~90 lines while being clearer
- Emphasis on evaluating ALL candidates together
### Performance Metrics
**Candidate Finding (Threaded)**:
- Speed: ~1.5 contacts/sec
- Time for 43K contacts: ~16-18 hours
- Workers: 8 (default)
**LLM Verification (Async)**:
- Speed: ~32 users/minute (100 concurrent requests)
- Cost: ~$0.003 per user (GPT-5-mini)
- Time for 43K users: ~23 hours
### Module Structure
```
ProfileMatching/
├── main.py # Interactive menu system
├── README.md # Complete documentation
├── CHANGELOG.md # This file
├── find_twitter_candidates.py # Core matching logic (10 methods)
├── find_twitter_candidates_threaded.py # Threaded implementation
├── verify_twitter_matches_v2.py # LLM verification (V6 prompt)
├── review_match_quality.py # Analysis tools
└── setup_twitter_matching_schema.sql # Database schema
```
### 10 Matching Methods Summary
1. **Phash Match** (0.95/0.88) - Profile picture similarity
2. **Exact Bio Handle** (0.95) - Twitter handle extracted from Telegram bio
3. **Bio URL Resolution** (0.95) ⭐ NEW - Shortened URL resolves to Telegram
4. **Twitter Bio Has Telegram** (0.92) - Twitter bio mentions Telegram username
5. **Display Name Containment** (0.92) - TG name in TW name
6. **Exact Username** (0.90) - Usernames match exactly
7. **TG Username in Twitter Name** (0.88)
8. **Twitter Username in TG Name** (0.86)
9. **Fuzzy Name** (0.65-0.85) - Trigram similarity
10. **Username Variation** (0.80) - Generated username variations
### Testing
All changes tested with:
- Standalone method testing (find_by_resolved_url)
- Full integration testing (find_candidates_for_contact)
- Verified deduplication works correctly
- Confirmed matches with different usernames are captured
### Next Steps
Potential future enhancements:
- Add more matching methods (location, bio keywords, mutual connections)
- Implement feedback loop for prompt improvement
- Add manual review interface for borderline matches
- Export matches to various formats
- Additional URL resolution sources beyond Twitter bios
### Migration Notes
**For existing deployments**:
1. No database schema changes required
2. Existing `url_resolution_queue` table is used as-is
3. Scripts in `scripts/` folder remain unchanged and functional
4. New ProfileMatching module is additive, doesn't break existing workflows
**To use new features**:
1. Use ProfileMatching/main.py instead of individual scripts
2. Or run scripts directly from ProfileMatching folder
3. Or update import paths to use ProfileMatching module

218
QUICKSTART.md Normal file
View File

@@ -0,0 +1,218 @@
# ProfileMatching Quick Start Guide
## 🚀 Fastest Way to Get Started
### Option 1: Interactive Menu (RECOMMENDED)
```bash
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
python3.10 main.py
```
This gives you an interactive menu with:
- Real-time statistics
- Guided workflow
- Easy access to all features
### Option 2: Launch from Main UnifiedContacts Menu
```bash
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts
python3.10 main.py
# Select option 15: "Open Profile Matching System"
```
## 📋 Typical Workflow
### Step 1: Find Candidates (First Time)
If you haven't found candidates yet, run:
```bash
# From ProfileMatching folder
python3.10 find_twitter_candidates_threaded.py --workers 8
# Or use interactive menu: Option 1
```
**Expected time**: ~16-18 hours for all 43K contacts
### Step 2: Verify with LLM
After candidates are found, verify them:
```bash
# From ProfileMatching folder
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
# Or use interactive menu: Option 3
```
**Expected time**: ~23 hours for all users
**Cost**: ~$130 for all users (GPT-5-mini at $0.003/user)
### Step 3: Review Results
```bash
# From ProfileMatching folder
python3.10 review_match_quality.py
# Or use interactive menu: Option 5
```
## 🧪 Test Mode (Recommended Before Full Run)
Always test with a small batch first:
```bash
# Test with 50 users
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
# Or use interactive menu: Option 4
```
This helps you:
- Verify the system is working correctly
- Check match quality before spending on full run
- Estimate costs and timing
## 📊 Check Current Status
At any time, you can check where you're at:
```bash
# Launch interactive menu and select Option 6: "Show statistics only"
python3.10 main.py
# Press 6, then 0 to exit
```
Or query directly:
```bash
psql -d telegram_contacts -U andrewjiang -c "
SELECT
COUNT(DISTINCT telegram_user_id) as users_with_candidates,
COUNT(*) as total_candidates,
COUNT(*) FILTER (WHERE llm_processed = TRUE) as processed,
COUNT(*) FILTER (WHERE llm_processed = FALSE) as pending
FROM twitter_match_candidates;
"
```
## 🔄 Re-Running After Updates
If you've updated the LLM prompt or matching logic:
### Re-find Candidates (if matching logic changed)
```bash
# Delete old candidates
psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_match_candidates CASCADE;"
# Re-run candidate finding
python3.10 find_twitter_candidates_threaded.py --workers 8
```
### Re-verify with New Prompt (if only prompt changed)
```bash
# Reset LLM processing flag
psql -d telegram_contacts -U andrewjiang -c "UPDATE twitter_match_candidates SET llm_processed = FALSE;"
# Delete old matches
psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_telegram_matches;"
# Re-run verification
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
```
## 🎯 Most Common Commands
### Find candidates for first 1000 contacts (testing)
```bash
python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
```
### Verify matches for pending candidates
```bash
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
```
### Check match quality distribution
```bash
python3.10 review_match_quality.py
```
### Export matches to CSV (coming soon)
```bash
# Will be added in future update
```
## 💡 Pro Tips
1. **Always use threaded candidate finding** - It's 10-20x faster
2. **Use high concurrency for verification** - 100-200 concurrent requests for optimal speed
3. **Test first** - Always run with `--test --limit 50` before full runs
4. **Monitor costs** - Check OpenAI dashboard during verification
5. **Check the stats** - Use Option 6 in interactive menu to monitor progress
## 🐛 Troubleshooting
### "No candidates found"
- Check if Twitter database has data: `psql -d twitter_data -c "SELECT COUNT(*) FROM users;"`
- Check Telegram contacts: `psql -d telegram_contacts -c "SELECT COUNT(*) FROM contacts;"`
### "LLM verification is slow"
- Increase `--concurrent` parameter (try 150-200)
- Check OpenAI rate limits in dashboard
- Verify network connection
### "Too many low-quality matches"
- Review the V6 prompt in `verify_twitter_matches_v2.py`
- Run `review_match_quality.py` to analyze
- Consider adjusting confidence thresholds
### "Missing obvious matches"
- Check if candidate was found:
```sql
SELECT * FROM twitter_match_candidates WHERE telegram_user_id = YOUR_USER_ID;
```
- If found but not verified, check `llm_verdict` field for reasoning
- If not found at all, may need new matching method
## 📚 More Information
- See `README.md` for complete documentation
- See `CHANGELOG.md` for recent updates
- See individual script files for command-line options
## 🆘 Need Help?
Common issues and solutions:
| Issue | Solution |
|-------|----------|
| Import errors | Make sure you're using python3.10 |
| Database connection errors | Check PostgreSQL is running: `pg_isready` |
| OpenAI API errors | Verify API key in `.env` file |
| Out of memory | Reduce concurrent requests or use batching |
## 🎓 Understanding the Output
### Candidate Finding Output
```
Processing contact 1000/43000 (2.3%)
Found 6 candidates for @username
• exact_username: @username (0.90)
• fuzzy_name: @similar_name (0.75)
```
### LLM Verification Output
```
[Progress] 500/1000 users (50.0%) | 125 matches | ~$1.50 | 25.0 users/min
```
### Match Quality Review
```
Total users with matches: 25,662
Total matches: 36,147
Average confidence: 0.74
Confidence Distribution:
90%+: 12,031 matches (HIGH)
80-89%: 1,452 matches (MEDIUM)
70-79%: 7,505 matches (LOW)
```

207
README.md Normal file
View File

@@ -0,0 +1,207 @@
# Twitter-Telegram Profile Matching System
A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.
## Overview
This system operates in two main steps:
1. **Candidate Finding**: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
2. **LLM Verification**: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)
## Quick Start
```bash
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
python3.10 main.py
```
## Matching Methods
The system uses 10 different methods to find Twitter candidates:
### High Confidence Methods (0.90-0.95)
1. **Phash Match** (0.95 for exact, 0.88 for distance=1)
- Compares profile picture hashes
- Pre-computed in `telegram_twitter_phash_matches` table
2. **Exact Bio Handle** (0.95)
- Extracts Twitter handles from Telegram bio
- Patterns: `@username`, `twitter.com/username`, `x.com/username`
3. **Bio URL Resolution** (0.95) ⭐ NEW
- Twitter bio contains shortened URL (t.co/xyz) that resolves to `t.me/username`
- Queries `url_resolution_queue` table
- Captures matches even when usernames differ
4. **Twitter Bio Has Telegram** (0.92)
- Reverse lookup: Twitter bio mentions Telegram username
- Searches for `@username`, `t.me/username`, `telegram.me/username`
5. **Display Name Containment** (0.92)
- Telegram name contained within Twitter display name
6. **Exact Username** (0.90)
- Telegram username exactly matches Twitter username
### Medium Confidence Methods (0.80-0.88)
7. **TG Username in Twitter Name** (0.88)
8. **Twitter Username in TG Name** (0.86)
9. **Fuzzy Name** (0.65-0.85)
- PostgreSQL trigram similarity with 0.65 threshold
10. **Username Variation** (0.80)
- Generates variations (remove underscores, flip numbers, etc.)
## LLM Verification
The system uses GPT-5-mini with a sophisticated V6 prompt that:
- Evaluates ALL candidates together (comparative evaluation)
- Applies differential scoring (only one can be "most likely")
- Distinguishes between personal and company accounts
- Considers signal strength holistically
- Only saves matches with 70%+ confidence
## Files
### Core Scripts
- `main.py` - Interactive menu for running the system
- `find_twitter_candidates.py` - Core matching logic (TwitterMatcher class)
- `find_twitter_candidates_threaded.py` - Threaded implementation (RECOMMENDED)
- `verify_twitter_matches_v2.py` - LLM verification with async (RECOMMENDED)
- `review_match_quality.py` - Analyze match quality and statistics
### Database Schema
- `setup_twitter_matching_schema.sql` - Database tables and indexes
## Database Tables
### `twitter_match_candidates`
Stores all potential matches found by the matching methods.
**Key fields:**
- `telegram_user_id` - Telegram contact user ID
- `twitter_id` - Twitter profile ID
- `match_method` - Which method found this candidate
- `baseline_confidence` - Initial confidence (0.0-1.0)
- `match_signals` - JSON with match details
- `llm_processed` - Whether LLM has evaluated this candidate
### `twitter_telegram_matches`
Stores verified matches (70%+ confidence from LLM).
**Key fields:**
- `telegram_user_id` - Telegram contact
- `twitter_id` - Matched Twitter profile
- `final_confidence` - LLM-assigned confidence (0.70-1.0)
- `llm_verdict` - LLM reasoning
- `match_method` - Original matching method
- `matched_at` - Timestamp
### `url_resolution_queue`
Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).
**Key fields:**
- `twitter_id` - Twitter profile ID
- `original_url` - Shortened URL (e.g., t.co/abc)
- `resolved_url` - Full URL (e.g., https://t.me/username)
- `telegram_handles` - Extracted Telegram handles (JSONB array)
## Usage Examples
### Find Candidates for All Contacts (Threaded)
```bash
python3.10 find_twitter_candidates_threaded.py --workers 8
```
### Find Candidates for First 1000 Contacts
```bash
python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
```
### Verify Matches with LLM (100 concurrent requests)
```bash
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
```
### Test Mode (50 users, 10 concurrent)
```bash
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
```
### Review Match Quality
```bash
python3.10 review_match_quality.py
```
## Performance
### Candidate Finding (Threaded)
- **Speed**: ~1.5 contacts/sec
- **Time for 43K contacts**: ~16-18 hours
- **Workers**: 8 (default, configurable)
### LLM Verification (Async)
- **Speed**: ~32 users/minute with 100 concurrent requests
- **Cost**: ~$0.003 per user (GPT-5-mini)
- **Time for 43K users**: ~23 hours
## Recent Improvements
### V6 Prompt (Latest)
- Upfront directive for comparative evaluation
- Clear signal strength hierarchy
- Company vs personal account differentiation
- Streamlined from ~135 to ~90 lines while being clearer
### URL Resolution Integration
- Added Method 5b: Bio URL resolution
- Captures 140+ additional matches
- Especially valuable when usernames differ
- 0.95 baseline confidence (very high)
## Configuration
Environment variables (in `/Users/andrewjiang/Bao/TimeToLockIn/Profile/.env`):
```
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-5-mini
```
Database connections:
- `telegram_contacts` - Telegram contact data
- `twitter_data` - Twitter profile data
## Tips
1. **Always run threaded candidate finding** - 10-20x faster than single-threaded
2. **Use high concurrency for LLM verification** - 100+ concurrent requests for optimal speed
3. **Monitor costs** - Check OpenAI usage during verification
4. **Review match quality periodically** - Use `review_match_quality.py` to analyze results
5. **Test first** - Use `--test --limit 50` flags before full runs
## Troubleshooting
### LLM verification is slow
- Increase `--concurrent` parameter (try 100-200)
- Check OpenAI rate limits (1,000 RPM for Tier 1)
### Many low-quality matches
- Review and adjust V6 prompt in `verify_twitter_matches_v2.py`
- Check `review_match_quality.py` for insights
### Missing obvious matches
- Check if candidate was found: Query `twitter_match_candidates`
- If not found, may need new matching method
- If found but not verified, check LLM reasoning in `llm_verdict`
## Future Enhancements
- Add more matching methods (location, bio keywords, etc.)
- Implement feedback loop for prompt improvement
- Add manual review interface for borderline matches
- Export matches to various formats

851
find_twitter_candidates.py Executable file
View File

@@ -0,0 +1,851 @@
#!/usr/bin/env python3
"""
Twitter-Telegram Candidate Finder
Finds potential Twitter matches for Telegram contacts using:
1. Handle extraction from bios
2. Username variation generation
3. Fuzzy name matching
"""
import sys
import re
import json
from pathlib import Path
from typing import List, Dict, Set, Tuple
import psycopg2
from psycopg2.extras import DictCursor, execute_values
# Add parent directory to path
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
from db_config import SessionLocal
from models import Contact
# Twitter database connection (adjust as needed)
TWITTER_DB_CONFIG = {
'dbname': 'twitter_data',
'user': 'andrewjiang', # Adjust to your setup
'host': 'localhost',
'port': 5432
}
class HandleExtractor:
"""Extract Twitter handles from text"""
@staticmethod
def extract_handles(text: str) -> List[str]:
"""Extract Twitter handles from bio/text"""
if not text:
return []
handles = set()
# Pattern 1: @username
pattern1 = r'@([a-zA-Z0-9_]{4,15})'
handles.update(re.findall(pattern1, text))
# Pattern 2: twitter.com/username or x.com/username
pattern2 = r'(?:twitter\.com|x\.com)/([a-zA-Z0-9_]{4,15})'
handles.update(re.findall(pattern2, text, re.IGNORECASE))
# Pattern 3: Clean standalone handles (risky, be conservative)
# Only if text is short and looks like a handle
if len(text) < 30 and text.count('@') == 1:
clean = text.strip('@').strip()
if re.match(r'^[a-zA-Z0-9_]{4,15}$', clean):
handles.add(clean)
return [h.lower() for h in handles if len(h) >= 4]
class UsernameVariationGenerator:
"""Generate Twitter handle variations from Telegram usernames"""
@staticmethod
def generate_variations(telegram_username: str) -> List[str]:
"""
Generate possible Twitter handle variations
Examples:
- alice_0x → [alice_0x, 0xalice, 0x_alice, alice0x]
- trader_69 → [trader_69, 69trader, 69_trader, trader69]
"""
if not telegram_username:
return []
variations = [telegram_username.lower()]
# Remove underscores
no_underscore = telegram_username.replace('_', '')
if no_underscore != telegram_username and len(no_underscore) >= 4:
variations.append(no_underscore.lower())
# Handle "0x" patterns (common in crypto)
if '0x' in telegram_username.lower():
# If 0x at end, try moving to front
if telegram_username.lower().endswith('_0x'):
base = telegram_username[:-3]
variations.extend([
f"0x{base}".lower(),
f"0x_{base}".lower()
])
elif telegram_username.lower().endswith('0x'):
base = telegram_username[:-2]
variations.extend([
f"0x{base}".lower(),
f"0x_{base}".lower()
])
# Try without underscore
no_under = telegram_username.replace('_', '')
if '0x' in no_under.lower() and no_under.lower() not in variations:
variations.append(no_under.lower())
# Handle trailing numbers (alice_69 → 69alice)
match = re.match(r'^([a-z_]+?)_?(\d+)$', telegram_username, re.IGNORECASE)
if match:
prefix, number = match.groups()
prefix = prefix.rstrip('_')
if len(f"{number}{prefix}") >= 4:
variations.extend([
f"{number}{prefix}".lower(),
f"{number}_{prefix}".lower()
])
# Handle leading numbers (69_alice → alice69)
match = re.match(r'^(\d+)_?([a-z_]+)$', telegram_username, re.IGNORECASE)
if match:
number, suffix = match.groups()
suffix = suffix.lstrip('_')
if len(f"{suffix}{number}") >= 4:
variations.extend([
f"{suffix}{number}".lower(),
f"{suffix}_{number}".lower()
])
# Single character removals (banteg → bantg, trader → trade)
# This catches shortened versions of usernames
base = telegram_username.lower()
if len(base) > 4: # Only if result would be at least 4 chars
for i in range(len(base)):
variation = base[:i] + base[i+1:]
if len(variation) >= 4:
variations.append(variation)
# Deduplicate and validate (4-15 chars for Twitter)
valid = []
for v in set(variations):
v_clean = v.strip('_')
if 4 <= len(v_clean) <= 15:
valid.append(v_clean)
return list(set(valid))
class TwitterMatcher:
"""Find Twitter profiles matching Telegram contacts"""
def __init__(self, twitter_conn, telegram_conn=None):
self.twitter_conn = twitter_conn
self.telegram_conn = telegram_conn # Needed for phash lookups
self.handle_extractor = HandleExtractor()
self.variation_generator = UsernameVariationGenerator()
def find_by_handle(self, handle: str) -> Dict:
"""Lookup Twitter profile by exact handle"""
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
cur.execute("""
SELECT
id,
username,
name,
description,
location,
verified,
is_blue_verified,
followers_count,
following_count,
created_at
FROM public.users
WHERE LOWER(username) = %s
LIMIT 1
""", (handle.lower(),))
result = cur.fetchone()
return dict(result) if result else None
def find_by_fuzzy_name(self, telegram_name: str, limit=3) -> List[Dict]:
"""Find Twitter profiles with similar names using fuzzy matching"""
if not telegram_name or len(telegram_name) < 3:
return []
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
# Use parameterized query with proper escaping for % operator
cur.execute("""
SELECT
id,
username,
name,
description,
location,
verified,
is_blue_verified,
followers_count,
following_count,
created_at,
similarity(name, %(name)s) AS name_score
FROM public.users
WHERE name %% %(name)s -- %% for similarity operator (escaped in string)
AND similarity(name, %(name)s) > 0.65 -- Increased threshold from 0.5 to reduce noise
ORDER BY name_score DESC
LIMIT %(limit)s
""", {'name': telegram_name, 'limit': limit})
return [dict(row) for row in cur.fetchall()]
def find_by_display_name_containment(self, telegram_name: str, limit=5) -> List[Dict]:
"""Find Twitter profiles where TG display name is contained in TW display name"""
if not telegram_name or len(telegram_name) < 3:
return []
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
# Direct containment search (case-insensitive)
cur.execute("""
SELECT
id,
username,
name,
description,
location,
verified,
is_blue_verified,
followers_count,
following_count,
created_at
FROM public.users
WHERE name ILIKE %s -- TG name contained in TW name
AND LENGTH(name) >= %s -- TW name must be at least as long as TG name
ORDER BY followers_count DESC -- Prioritize by follower count
LIMIT %s
""", (f'%{telegram_name}%', len(telegram_name), limit))
return [dict(row) for row in cur.fetchall()]
def find_by_phash_match(self, telegram_user_id: int) -> List[Dict]:
"""Find Twitter profiles with matching profile picture phash (distance 0-1 only)"""
if not self.telegram_conn:
return []
with self.telegram_conn.cursor(cursor_factory=DictCursor) as cur:
# Query pre-computed phash matches (distance 0-1 for high confidence)
cur.execute("""
SELECT
m.twitter_user_id,
m.twitter_username,
m.hamming_distance,
m.telegram_phash,
m.twitter_phash
FROM telegram_twitter_phash_matches m
WHERE m.telegram_user_id = %s
AND m.hamming_distance <= 1 -- Only exact and distance-1 matches
ORDER BY m.hamming_distance ASC
LIMIT 5
""", (telegram_user_id,))
phash_matches = cur.fetchall()
# Commit to close transaction
self.telegram_conn.commit()
if not phash_matches:
return []
# Fetch full Twitter profile data for matched users
twitter_ids = [m['twitter_user_id'] for m in phash_matches]
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
cur.execute("""
SELECT
id,
username,
name,
description,
location,
verified,
is_blue_verified,
followers_count,
following_count,
created_at
FROM public.users
WHERE id = ANY(%s)
""", (twitter_ids,))
twitter_profiles = [dict(row) for row in cur.fetchall()]
# Enrich with phash match details
twitter_profile_map = {str(p['id']): p for p in twitter_profiles}
results = []
for match in phash_matches:
tw_id = match['twitter_user_id']
if tw_id in twitter_profile_map:
profile = twitter_profile_map[tw_id].copy()
profile['phash_distance'] = match['hamming_distance']
profile['telegram_phash'] = match['telegram_phash']
profile['twitter_phash'] = match['twitter_phash']
results.append(profile)
return results
def find_by_telegram_in_twitter_bio(self, telegram_username: str, limit=3) -> List[Dict]:
"""Find Twitter profiles that mention this Telegram username in their bio (exact @mention only)"""
if not telegram_username or len(telegram_username) < 4:
return []
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
# Search for various Telegram handle patterns in Twitter bios
# Use regex with word boundaries to avoid substring matches in other handles
cur.execute("""
SELECT
id,
username,
name,
description,
location,
verified,
is_blue_verified,
followers_count,
following_count,
created_at
FROM public.users
WHERE description IS NOT NULL
AND (
description ~* %s -- @username with word boundary (not part of longer handle)
OR LOWER(description) LIKE %s -- t.me/username
OR LOWER(description) LIKE %s -- telegram.me/username
)
ORDER BY followers_count DESC
LIMIT %s
""", (
r'@' + telegram_username + r'(\s|$|[^a-zA-Z0-9_])', # Word boundary: space, end, or non-alphanumeric
f'%t.me/{telegram_username.lower()}%',
f'%telegram.me/{telegram_username.lower()}%',
limit
))
return [dict(row) for row in cur.fetchall()]
def find_by_resolved_url(self, telegram_username: str) -> List[Dict]:
"""Find Twitter profiles whose bio URL resolves to this Telegram username"""
if not telegram_username or len(telegram_username) < 4:
return []
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
# Query url_resolution_queue for Twitter profiles with resolved Telegram links
cur.execute("""
SELECT DISTINCT
u.id,
u.username,
u.name,
u.description,
u.location,
u.verified,
u.is_blue_verified,
u.followers_count,
u.following_count,
u.created_at,
urq.resolved_url,
urq.original_url
FROM url_resolution_queue urq
JOIN public.users u ON u.id::text = urq.twitter_id
WHERE urq.telegram_handles IS NOT NULL
AND urq.telegram_handles @> %s::jsonb
ORDER BY u.followers_count DESC
LIMIT 3
""", (f'["{telegram_username.lower()}"]',))
results = cur.fetchall()
# Enrich with URL details
enriched = []
for row in results:
profile = dict(row)
profile['resolved_telegram_url'] = profile.pop('resolved_url')
profile['original_url'] = profile.pop('original_url')
enriched.append(profile)
return enriched
def find_by_username_in_display_name(self, search_term: str, is_telegram: bool, limit: int = 5) -> List[Dict]:
"""
Find Twitter profiles where display name contains username pattern
If is_telegram=True: Search for TG username in Twitter display names
If is_telegram=False: Search for Twitter username pattern in TG display names (reverse: find Twitter profiles whose username matches the TG display name)
"""
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
if is_telegram:
# TG username in Twitter display name
cur.execute("""
SELECT
id,
username,
name,
description,
location,
verified,
is_blue_verified,
followers_count,
following_count,
created_at
FROM public.users
WHERE name IS NOT NULL
AND LOWER(name) LIKE %s
ORDER BY followers_count DESC
LIMIT %s
""", (f'%{search_term.lower()}%', limit))
else:
# Twitter username in TG display name - search for Twitter profiles whose username appears in the search term
cur.execute("""
SELECT
id,
username,
name,
description,
location,
verified,
is_blue_verified,
followers_count,
following_count,
created_at
FROM public.users
WHERE username IS NOT NULL
AND LOWER(%s) LIKE '%%' || LOWER(username) || '%%'
ORDER BY followers_count DESC
LIMIT %s
""", (search_term, limit))
return [dict(row) for row in cur.fetchall()]
def get_display_name(self, contact: Contact) -> str:
"""Get display name with fallback to first_name + last_name"""
if contact.display_name:
return contact.display_name
# Fallback to first_name + last_name
parts = [contact.first_name, contact.last_name]
return ' '.join(p for p in parts if p).strip()
def find_candidates_for_contact(self, contact: Contact) -> List[Dict]:
"""
Find all Twitter candidates for a single Telegram contact
Returns list of candidates with:
{
'twitter_profile': {...},
'match_method': 'exact_bio_handle' | 'exact_username' | 'username_variation' | 'fuzzy_name',
'baseline_confidence': 0.0-1.0,
'match_details': {...}
}
"""
candidates = []
seen_twitter_ids = set()
# Get display name with fallback
display_name = self.get_display_name(contact)
# Method 1: Phash matching (profile picture similarity)
phash_matches = self.find_by_phash_match(contact.user_id)
for twitter_profile in phash_matches:
phash_distance = twitter_profile.pop('phash_distance')
telegram_phash = twitter_profile.pop('telegram_phash')
twitter_phash = twitter_profile.pop('twitter_phash')
# Baseline confidence based on phash distance
if phash_distance == 0:
baseline_confidence = 0.95 # Exact match - VERY strong signal
elif phash_distance == 1:
baseline_confidence = 0.88 # 1-bit difference - strong signal
else:
continue # Skip distance > 1
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'phash_match',
'baseline_confidence': baseline_confidence,
'match_details': {
'phash_distance': phash_distance,
'telegram_phash': telegram_phash,
'twitter_phash': twitter_phash
}
})
seen_twitter_ids.add(twitter_profile['id'])
# Method 2: Extract handles from bio
if contact.bio:
bio_handles = self.handle_extractor.extract_handles(contact.bio)
for handle in bio_handles:
twitter_profile = self.find_by_handle(handle)
if twitter_profile and twitter_profile['id'] not in seen_twitter_ids:
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'exact_bio_handle',
'baseline_confidence': 0.95,
'match_details': {
'extracted_handle': handle,
'from_bio': True
}
})
seen_twitter_ids.add(twitter_profile['id'])
# Method 3: Exact username match
if contact.username:
twitter_profile = self.find_by_handle(contact.username)
if twitter_profile and twitter_profile['id'] not in seen_twitter_ids:
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'exact_username',
'baseline_confidence': 0.90,
'match_details': {
'telegram_username': contact.username,
'twitter_username': twitter_profile['username']
}
})
seen_twitter_ids.add(twitter_profile['id'])
# Method 4: Username variations
if contact.username:
variations = self.variation_generator.generate_variations(contact.username)
for variation in variations:
if variation == contact.username.lower():
continue # Already checked in Method 2
twitter_profile = self.find_by_handle(variation)
if twitter_profile and twitter_profile['id'] not in seen_twitter_ids:
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'username_variation',
'baseline_confidence': 0.80,
'match_details': {
'telegram_username': contact.username,
'username_variation': variation,
'twitter_username': twitter_profile['username']
}
})
seen_twitter_ids.add(twitter_profile['id'])
# Method 5: Twitter bio contains Telegram username (reverse lookup)
if contact.username:
reverse_matches = self.find_by_telegram_in_twitter_bio(contact.username, limit=3)
for twitter_profile in reverse_matches:
if twitter_profile['id'] not in seen_twitter_ids:
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'twitter_bio_has_telegram',
'baseline_confidence': 0.92, # Very high confidence - they explicitly mention their Telegram
'match_details': {
'telegram_username': contact.username,
'twitter_username': twitter_profile['username'],
'found_in_twitter_bio': True
}
})
seen_twitter_ids.add(twitter_profile['id'])
# Method 5b: Twitter bio URL resolves to Telegram username (via url_resolution_queue)
if contact.username:
url_resolved_matches = self.find_by_resolved_url(contact.username)
for twitter_profile in url_resolved_matches:
if twitter_profile['id'] not in seen_twitter_ids:
resolved_url = twitter_profile.pop('resolved_telegram_url', None)
original_url = twitter_profile.pop('original_url', None)
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'twitter_bio_url_resolves_to_telegram',
'baseline_confidence': 0.95, # VERY high confidence - explicit URL link in bio
'match_details': {
'telegram_username': contact.username,
'twitter_username': twitter_profile['username'],
'resolved_url': resolved_url,
'original_url': original_url,
'found_via_url_resolution': True
}
})
seen_twitter_ids.add(twitter_profile['id'])
# Method 6: Display name containment (TG name in TW name)
if display_name:
containment_matches = self.find_by_display_name_containment(display_name, limit=5)
for twitter_profile in containment_matches:
if twitter_profile['id'] not in seen_twitter_ids:
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'display_name_containment',
'baseline_confidence': 0.92, # High confidence for exact name containment
'match_details': {
'telegram_name': display_name,
'twitter_name': twitter_profile['name'],
'match_type': 'tg_name_contained_in_tw_name'
}
})
seen_twitter_ids.add(twitter_profile['id'])
# Method 7: Fuzzy name match (always run to find additional candidates)
if display_name:
fuzzy_matches = self.find_by_fuzzy_name(display_name, limit=5)
for i, twitter_profile in enumerate(fuzzy_matches):
name_score = twitter_profile.pop('name_score')
# Calculate baseline confidence from name similarity
baseline_confidence = min(0.85, name_score) # Cap at 0.85 for fuzzy matches
if twitter_profile['id'] not in seen_twitter_ids:
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'fuzzy_name',
'baseline_confidence': baseline_confidence,
'match_details': {
'telegram_name': display_name,
'twitter_name': twitter_profile['name'],
'fuzzy_score': name_score,
'candidate_rank': i + 1
}
})
seen_twitter_ids.add(twitter_profile['id'])
# Method 8: TG username in Twitter display name
if contact.username:
matches = self.find_by_username_in_display_name(contact.username, is_telegram=True, limit=3)
for twitter_profile in matches:
if twitter_profile['id'] not in seen_twitter_ids:
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'tg_username_in_twitter_name',
'baseline_confidence': 0.88,
'match_details': {
'telegram_username': contact.username,
'twitter_name': twitter_profile['name'],
'found_in_display_name': True
}
})
seen_twitter_ids.add(twitter_profile['id'])
# Method 9: Twitter username in TG display name
if display_name:
matches = self.find_by_username_in_display_name(display_name, is_telegram=False, limit=3)
for twitter_profile in matches:
if twitter_profile['id'] not in seen_twitter_ids:
candidates.append({
'twitter_profile': twitter_profile,
'match_method': 'twitter_username_in_tg_name',
'baseline_confidence': 0.86,
'match_details': {
'telegram_name': display_name,
'twitter_username': twitter_profile['username'],
'username_in_display_name': True
}
})
seen_twitter_ids.add(twitter_profile['id'])
return candidates
def save_candidates_to_db(telegram_user_id: int, account_id: int, candidates: List[Dict], telegram_db):
"""Save candidates to twitter_match_candidates table"""
if not candidates:
return
insert_data = []
for cand in candidates:
tw = cand['twitter_profile']
insert_data.append((
account_id, # Add account_id
telegram_user_id,
tw['id'],
tw['username'],
tw['name'],
tw.get('description', ''),
tw.get('location'),
tw.get('verified', False),
tw.get('is_blue_verified', False),
tw.get('followers_count', 0),
cand.get('match_details', {}).get('candidate_rank', 1),
cand['match_method'],
cand['baseline_confidence'],
json.dumps(cand['match_details']) # Convert dict to JSON string
))
execute_values(telegram_db, """
INSERT INTO twitter_match_candidates (
account_id,
telegram_user_id,
twitter_id,
twitter_username,
twitter_name,
twitter_bio,
twitter_location,
twitter_verified,
twitter_blue_verified,
twitter_followers_count,
candidate_rank,
match_method,
baseline_confidence,
match_signals
) VALUES %s
ON CONFLICT DO NOTHING
""", insert_data, template="""(
%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s
)""")
def main():
print()
print("=" * 70)
print("🔍 Twitter-Telegram Candidate Finder")
print("=" * 70)
print()
# Check arguments
test_mode = '--test' in sys.argv
limit = 100 if test_mode else None
# Check for --limit parameter
if '--limit' in sys.argv:
idx = sys.argv.index('--limit')
limit = int(sys.argv[idx + 1])
print(f"📊 LIMIT MODE: Processing first {limit:,} contacts")
print()
elif test_mode:
print("🧪 TEST MODE: Processing first 100 contacts only")
print()
# Connect to databases
print("📡 Connecting to databases...")
# Connect SQLAlchemy to localhost (not RDS)
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
localhost_engine = create_engine('postgresql://andrewjiang@localhost:5432/telegram_contacts')
LocalSession = sessionmaker(bind=localhost_engine)
telegram_db = LocalSession()
try:
twitter_conn = psycopg2.connect(**TWITTER_DB_CONFIG)
twitter_conn.autocommit = False
except Exception as e:
print(f"❌ Failed to connect to Twitter database: {e}")
print(f" Config: {TWITTER_DB_CONFIG}")
return False
# Also need psycopg2 connection to telegram DB for writing
try:
telegram_conn = psycopg2.connect(dbname='telegram_contacts', user='andrewjiang', host='localhost', port=5432)
telegram_conn.autocommit = False
except Exception as e:
print(f"❌ Failed to connect to Telegram database: {e}")
return False
print("✅ Connected to both databases")
print()
try:
# Initialize matcher (pass both connections for phash lookup)
telegram_psycopg_conn = psycopg2.connect(
dbname='telegram_contacts',
user='andrewjiang',
host='localhost'
)
matcher = TwitterMatcher(twitter_conn, telegram_psycopg_conn)
# Get Telegram contacts with bios or usernames
print("🔍 Loading Telegram contacts...")
query = telegram_db.query(Contact).filter(
Contact.user_id > 0, # Exclude channels
Contact.is_deleted == False,
Contact.is_bot == False
).filter(
(Contact.bio != None) | (Contact.username != None)
).order_by(Contact.user_id)
if limit:
query = query.limit(limit)
contacts = query.all()
print(f"✅ Found {len(contacts):,} contacts to process")
print()
# Process each contact
stats = {
'processed': 0,
'with_candidates': 0,
'total_candidates': 0,
'by_method': {}
}
print("🚀 Finding Twitter candidates...")
print()
for i, contact in enumerate(contacts, 1):
candidates = matcher.find_candidates_for_contact(contact)
if candidates:
stats['with_candidates'] += 1
stats['total_candidates'] += len(candidates)
# Track by method
for cand in candidates:
method = cand['match_method']
stats['by_method'][method] = stats['by_method'].get(method, 0) + 1
# Save to database
with telegram_conn.cursor() as cur:
save_candidates_to_db(contact.user_id, contact.account_id, candidates, cur)
telegram_conn.commit()
stats['processed'] += 1
# Progress update every 10 contacts
if i % 10 == 0:
print(f" Processed {i:,}/{len(contacts):,} contacts... (candidates: {stats['with_candidates']:,}, total: {stats['total_candidates']:,})")
elif i == 1:
# Show first contact immediately
print(f" Processed 1/{len(contacts):,} contacts... (candidates: {stats['with_candidates']:,}, total: {stats['total_candidates']:,})")
# Final stats
print()
print("=" * 70)
print("✅ CANDIDATE FINDING COMPLETE")
print("=" * 70)
print()
print(f"📊 Statistics:")
print(f" Processed: {stats['processed']:,} contacts")
print(f" With candidates: {stats['with_candidates']:,} ({stats['with_candidates']/stats['processed']*100:.1f}%)")
print(f" Total candidates: {stats['total_candidates']:,}")
print(f" Avg candidates per match: {stats['total_candidates']/max(stats['with_candidates'], 1):.1f}")
print()
print(f"📈 By method:")
for method, count in sorted(stats['by_method'].items(), key=lambda x: -x[1]):
print(f" {method}: {count:,}")
print()
return True
except Exception as e:
print(f"❌ Error: {e}")
import traceback
traceback.print_exc()
return False
finally:
telegram_db.close()
twitter_conn.close()
telegram_conn.close()
if __name__ == "__main__":
try:
success = main()
sys.exit(0 if success else 1)
except KeyboardInterrupt:
print("\n\n⚠️ Interrupted by user")
sys.exit(1)

View File

@@ -0,0 +1,389 @@
#!/usr/bin/env python3
"""
Threading-based Twitter-Telegram Candidate Finder
Uses threading instead of multiprocessing for better macOS compatibility
and efficient I/O-bound parallel processing.
"""
import sys
from pathlib import Path
import psycopg2
from psycopg2.extras import execute_values, DictCursor
from concurrent.futures import ThreadPoolExecutor, as_completed
import argparse
from typing import List, Dict, Tuple
import time
import threading
# Add parent directory to path
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from models import Contact
# Import the matcher from the original script
from find_twitter_candidates import TwitterMatcher, TWITTER_DB_CONFIG
# Database configuration
TELEGRAM_DB_URL = 'postgresql://andrewjiang@localhost:5432/telegram_contacts'
# Thread-local storage for database connections
thread_local = threading.local()
def get_thread_connections():
"""Get or create database connections for this thread"""
if not hasattr(thread_local, 'twitter_conn'):
thread_local.twitter_conn = psycopg2.connect(**TWITTER_DB_CONFIG)
thread_local.telegram_conn = psycopg2.connect(
dbname='telegram_contacts',
user='andrewjiang',
host='localhost',
port=5432
)
thread_local.matcher = TwitterMatcher(
thread_local.twitter_conn,
thread_local.telegram_conn
)
return thread_local.twitter_conn, thread_local.telegram_conn, thread_local.matcher
def process_contact(contact: Contact) -> Tuple[bool, int, List[Dict], Dict]:
"""
Process a single contact in a worker thread.
Returns: (success, num_candidates, candidates, method_stats)
"""
import sys
import io
# Capture stderr to detect silent failures
old_stderr = sys.stderr
sys.stderr = stderr_capture = io.StringIO()
try:
# Get thread-local database connections
twitter_conn, telegram_conn, matcher = get_thread_connections()
# Find candidates
candidates = matcher.find_candidates_for_contact(contact)
# Check if any warnings were logged
stderr_content = stderr_capture.getvalue()
if stderr_content:
print(f" 🔍 Warnings for contact {contact.user_id}:")
print(f" {stderr_content}")
method_stats = {}
if candidates:
# Track method stats
for cand in candidates:
method = cand['match_method']
method_stats[method] = method_stats.get(method, 0) + 1
# Add contact info to each candidate for later insertion
for cand in candidates:
cand['telegram_user_id'] = contact.user_id
cand['account_id'] = contact.account_id
return True, len(candidates), candidates, method_stats
except Exception as e:
print(f" ⚠️ Error processing contact {contact.user_id}: {e}")
print(f" Contact details: username={contact.username}, display_name={getattr(contact, 'display_name', None)}")
import traceback
traceback.print_exc()
return False, 0, [], {}
finally:
sys.stderr = old_stderr
def save_candidates_batch(candidates: List[Dict], telegram_conn):
"""Save a batch of candidates to database"""
if not candidates:
return 0
insert_data = []
for cand in candidates:
tw = cand['twitter_profile']
match_signals = cand.get('match_details', {})
insert_data.append((
cand['account_id'],
cand['telegram_user_id'],
tw['id'],
tw['username'],
tw.get('name'),
tw.get('description'),
tw.get('location'),
tw.get('verified', False),
tw.get('is_blue_verified', False),
tw.get('followers_count', 0),
0, # candidate_rank (will be set later if needed)
cand['match_method'],
cand['baseline_confidence'],
psycopg2.extras.Json(match_signals),
True, # needs_llm_review
False, # llm_processed
))
with telegram_conn.cursor() as cur:
execute_values(cur, """
INSERT INTO twitter_match_candidates (
account_id,
telegram_user_id,
twitter_id,
twitter_username,
twitter_name,
twitter_bio,
twitter_location,
twitter_verified,
twitter_blue_verified,
twitter_followers_count,
candidate_rank,
match_method,
baseline_confidence,
match_signals,
needs_llm_review,
llm_processed
) VALUES %s
ON CONFLICT (telegram_user_id, twitter_id) DO NOTHING
""", insert_data, page_size=1000)
telegram_conn.commit()
return len(insert_data)
def main():
parser = argparse.ArgumentParser(description='Find Twitter candidates for Telegram contacts (threading-based)')
parser.add_argument('--limit', type=int, help='Limit number of Telegram contacts to process')
parser.add_argument('--test', action='store_true', help='Test mode: process first 100 contacts only')
parser.add_argument('--workers', type=int, default=8,
help='Number of worker threads (default: 8)')
parser.add_argument('--user-id-min', type=int, help='Minimum user_id to process (for parallel ranges)')
parser.add_argument('--user-id-max', type=int, help='Maximum user_id to process (for parallel ranges)')
parser.add_argument('--range-name', type=str, help='Name for this range (for logging)')
args = parser.parse_args()
num_workers = args.workers
limit = args.limit
user_id_min = args.user_id_min
user_id_max = args.user_id_max
range_name = args.range_name or "full"
print("=" * 70)
print(f"🔍 Twitter-Telegram Candidate Finder (THREADED) - Range: {range_name}")
print("=" * 70)
print()
if args.test:
limit = 100
print("🧪 TEST MODE: Processing first 100 contacts only")
print()
elif limit:
print(f"📊 LIMIT MODE: Processing first {limit:,} contacts")
print()
if user_id_min is not None and user_id_max is not None:
print(f"📍 User ID Range: {user_id_min:,} to {user_id_max:,}")
print()
print(f"🧵 Worker threads: {num_workers}")
print()
# Load contacts using raw psycopg2
print("📡 Loading Telegram contacts...")
conn = psycopg2.connect(
dbname='telegram_contacts',
user='andrewjiang',
host='localhost',
port=5432
)
with conn.cursor() as cur:
# First, get list of already processed contacts
cur.execute("""
SELECT DISTINCT telegram_user_id
FROM twitter_match_candidates
""")
already_processed = set(row[0] for row in cur.fetchall())
print(f"📋 Already processed: {len(already_processed):,} contacts (will skip)")
print()
query = """
SELECT account_id, user_id, display_name, first_name, last_name, username, phone, bio, is_bot, is_deleted
FROM contacts
WHERE user_id > 0
AND is_deleted = false
AND is_bot = false
AND (bio IS NOT NULL OR username IS NOT NULL)
"""
# Add user_id range filter if specified
if user_id_min is not None:
query += f" AND user_id >= {user_id_min}"
if user_id_max is not None:
query += f" AND user_id <= {user_id_max}"
query += " ORDER BY user_id"
if limit:
query += f" LIMIT {limit}"
cur.execute(query)
rows = cur.fetchall()
conn.close()
# Convert to Contact objects, skipping already processed
contacts = []
skipped = 0
for row in rows:
user_id = row[1]
# Skip if already processed
if user_id in already_processed:
skipped += 1
continue
contact = Contact(
account_id=row[0],
user_id=user_id,
display_name=row[2],
first_name=row[3],
last_name=row[4],
username=row[5],
phone=row[6],
bio=row[7],
is_bot=row[8],
is_deleted=row[9]
)
contacts.append(contact)
total_contacts = len(contacts)
print(f"✅ Found {total_contacts:,} NEW contacts to process (skipped {skipped:,} already done)")
print()
print("🚀 Processing contacts with thread pool...")
print()
start_time = time.time()
# Stats tracking
total_processed = 0
total_with_candidates = 0
all_candidates = []
combined_method_stats = {}
total_saved = 0
# Database connection for incremental saves
telegram_conn = psycopg2.connect(
dbname='telegram_contacts',
user='andrewjiang',
host='localhost',
port=5432
)
# Process with ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=num_workers) as executor:
# Submit all tasks
future_to_contact = {
executor.submit(process_contact, contact): contact
for contact in contacts
}
# Process results as they complete
for i, future in enumerate(as_completed(future_to_contact), 1):
try:
success, num_candidates, candidates, method_stats = future.result()
if success:
total_processed += 1
if candidates:
total_with_candidates += 1
all_candidates.extend(candidates)
# Update method stats
for method, count in method_stats.items():
combined_method_stats[method] = combined_method_stats.get(method, 0) + count
# Save to database every 100 contacts
if len(all_candidates) >= 100:
saved = save_candidates_batch(all_candidates, telegram_conn)
total_saved += saved
all_candidates = [] # Clear buffer
# Progress update every 10 contacts
if i % 10 == 0 or i == 1:
elapsed = time.time() - start_time
rate = total_processed / elapsed if elapsed > 0 else 0
remaining = total_contacts - total_processed
eta_seconds = remaining / rate if rate > 0 else 0
eta_hours = eta_seconds / 3600
print(f" Progress: {total_processed}/{total_contacts} ({total_processed/total_contacts*100:.1f}%) | "
f"{total_saved + len(all_candidates):,} candidates (💾 {total_saved:,} saved) | "
f"Rate: {rate:.1f}/sec | "
f"ETA: {eta_hours:.1f}h")
except Exception as e:
print(f" ❌ Error processing future: {e}")
import traceback
traceback.print_exc()
# Save any remaining candidates
if all_candidates:
saved = save_candidates_batch(all_candidates, telegram_conn)
total_saved += saved
telegram_conn.close()
elapsed = time.time() - start_time
print()
print(f"⏱️ Processing completed in {elapsed:.1f} seconds ({elapsed/60:.1f} minutes)")
print(f" Rate: {total_processed/elapsed:.1f} contacts/second")
print()
print(f"💾 Total saved: {total_saved:,} candidates")
print()
# Print stats
print("=" * 70)
print("✅ CANDIDATE FINDING COMPLETE")
print("=" * 70)
print()
print(f"📊 Statistics:")
print(f" Processed: {total_processed:,} contacts")
print(f" With candidates: {total_with_candidates:,} ({total_with_candidates/total_processed*100:.1f}%)")
print(f" Total candidates: {total_saved:,}")
if total_with_candidates > 0:
print(f" Avg candidates per match: {total_saved/total_with_candidates:.1f}")
print()
print(f"📈 By method:")
for method, count in sorted(combined_method_stats.items(), key=lambda x: x[1], reverse=True):
print(f" {method}: {count:,}")
print()
return True
if __name__ == "__main__":
try:
success = main()
sys.exit(0 if success else 1)
except KeyboardInterrupt:
print("\n\n❌ Interrupted by user")
sys.exit(1)
except Exception as e:
print(f"\n❌ Error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)

225
main.py Executable file
View File

@@ -0,0 +1,225 @@
#!/usr/bin/env python3
"""
Twitter-Telegram Profile Matching System
Main menu for finding candidates and verifying matches with LLM
"""
import sys
import os
import subprocess
import psycopg2
# Add parent directory to path for imports
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', 'src'))
# Database configuration
DB_CONFIG = {
'dbname': 'telegram_contacts',
'user': 'andrewjiang',
'host': 'localhost',
'port': 5432
}
TWITTER_DB_CONFIG = {
'dbname': 'twitter_data',
'user': 'andrewjiang',
'host': 'localhost',
'port': 5432
}
def get_stats():
"""Get current matching statistics"""
conn = psycopg2.connect(**DB_CONFIG)
cur = conn.cursor()
stats = {}
# Candidates stats
cur.execute("""
SELECT
COUNT(DISTINCT telegram_user_id) as total_users,
COUNT(*) as total_candidates,
COUNT(*) FILTER (WHERE llm_processed = TRUE) as processed_candidates,
COUNT(*) FILTER (WHERE llm_processed = FALSE) as pending_candidates
FROM twitter_match_candidates
""")
row = cur.fetchone()
stats['total_users'] = row[0]
stats['total_candidates'] = row[1]
stats['processed_candidates'] = row[2]
stats['pending_candidates'] = row[3]
# Matches stats
cur.execute("""
SELECT
COUNT(*) as total_matches,
AVG(final_confidence) as avg_confidence,
COUNT(*) FILTER (WHERE final_confidence >= 0.90) as high_conf,
COUNT(*) FILTER (WHERE final_confidence >= 0.80 AND final_confidence < 0.90) as med_conf,
COUNT(*) FILTER (WHERE final_confidence >= 0.70 AND final_confidence < 0.80) as low_conf
FROM twitter_telegram_matches
""")
row = cur.fetchone()
stats['total_matches'] = row[0]
stats['avg_confidence'] = row[1] or 0
stats['high_conf'] = row[2]
stats['med_conf'] = row[3]
stats['low_conf'] = row[4]
# Users with matches
cur.execute("""
SELECT COUNT(DISTINCT telegram_user_id)
FROM twitter_telegram_matches
""")
stats['users_with_matches'] = cur.fetchone()[0]
cur.close()
conn.close()
return stats
def print_header():
"""Print main header"""
print()
print("=" * 80)
print("🔗 Twitter-Telegram Profile Matching System")
print("=" * 80)
print()
def print_stats():
"""Print current statistics"""
stats = get_stats()
print("📊 Current Statistics:")
print("-" * 80)
print(f"Candidates:")
print(f" • Users with candidates: {stats['total_users']:,}")
print(f" • Total candidates found: {stats['total_candidates']:,}")
print(f" • Processed by LLM: {stats['processed_candidates']:,}")
print(f" • Pending verification: {stats['pending_candidates']:,}")
print()
print(f"Verified Matches:")
print(f" • Users with matches: {stats['users_with_matches']:,}")
print(f" • Total matches: {stats['total_matches']:,}")
print(f" • Average confidence: {stats['avg_confidence']:.2f}")
print(f" • High confidence (90%+): {stats['high_conf']:,}")
print(f" • Medium confidence (80-89%): {stats['med_conf']:,}")
print(f" • Low confidence (70-79%): {stats['low_conf']:,}")
print("-" * 80)
print()
def run_script(script_name, *args):
"""Run a Python script with arguments"""
script_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), f"{script_name}.py")
cmd = ['python3.10', script_path] + list(args)
subprocess.run(cmd)
def main():
while True:
print_header()
print_stats()
print("📋 Main Menu:")
print()
print("STEP 1: Find Candidates")
print(" 1. Find Twitter candidates (threaded, RECOMMENDED)")
print(" 2. Find Twitter candidates (single-threaded)")
print()
print("STEP 2: Verify with LLM")
print(" 3. Verify matches with LLM (async, RECOMMENDED)")
print(" 4. Verify matches with LLM (test mode - 50 users)")
print()
print("Analysis & Review")
print(" 5. Review match quality")
print(" 6. Show statistics only")
print()
print(" 0. Exit")
print()
choice = input("👉 Enter your choice: ").strip()
if choice == '0':
print("\n👋 Goodbye!\n")
break
elif choice == '1':
# Find candidates (threaded)
print()
print("🔍 Finding Twitter candidates (threaded mode)...")
print()
limit_input = input("👉 How many contacts? (press Enter for all): ").strip()
workers = input("👉 Number of worker threads (default: 8): ").strip() or '8'
if limit_input:
run_script('find_twitter_candidates_threaded', '--limit', limit_input, '--workers', workers)
else:
run_script('find_twitter_candidates_threaded', '--workers', workers)
input("\n✅ Press Enter to continue...")
elif choice == '2':
# Find candidates (single-threaded)
print()
print("🔍 Finding Twitter candidates (single-threaded mode)...")
print()
limit_input = input("👉 How many contacts? (press Enter for all): ").strip()
if limit_input:
run_script('find_twitter_candidates', '--limit', limit_input)
else:
run_script('find_twitter_candidates')
input("\n✅ Press Enter to continue...")
elif choice == '3':
# Verify with LLM (async)
print()
print("🤖 Verifying matches with LLM (async mode)...")
print()
concurrent = input("👉 Concurrent requests (default: 100): ").strip() or '100'
run_script('verify_twitter_matches_v2', '--verbose', '--concurrent', concurrent)
input("\n✅ Press Enter to continue...")
elif choice == '4':
# Verify with LLM (test mode)
print()
print("🧪 Test mode: Verifying 50 users with LLM...")
print()
run_script('verify_twitter_matches_v2', '--test', '--limit', '50', '--verbose', '--concurrent', '10')
input("\n✅ Press Enter to continue...")
elif choice == '5':
# Review match quality
print()
print("📊 Reviewing match quality...")
print()
run_script('review_match_quality')
input("\n✅ Press Enter to continue...")
elif choice == '6':
# Just show stats, loop back to menu
continue
else:
print("\n❌ Invalid choice. Please try again.\n")
input("Press Enter to continue...")
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
print("\n\n👋 Interrupted. Goodbye!\n")
sys.exit(0)

406
review_match_quality.py Executable file
View File

@@ -0,0 +1,406 @@
#!/usr/bin/env python3
"""
Critical Match Quality Reviewer
Analyzes verification results with deep understanding of:
- Twitter/Telegram/crypto culture
- Common false positive patterns
- Company vs personal account indicators
- Context alignment
"""
import re
import json
from pathlib import Path
from typing import List, Dict, Tuple
LOG_FILE = Path(__file__).parent.parent / 'verification_v2_log.txt'
class MatchReviewer:
"""Critical reviewer with crypto/web3 domain knowledge"""
# Company/product/team indicators
COMPANY_INDICATORS = [
r'\bis\s+(a|an)\s+\w+\s+(way|tool|platform|app|service|protocol)',
r'official\s+(account|page|channel)',
r'(team|official)\s*$',
r'^(the|a)\s+\w+\s+(for|to)',
r'brought to you by',
r'hosted by',
r'founded by',
r'(community|project)\s+(account|page)',
r'(dao|protocol|network)\s*$',
r'building\s+(the|a)\s+',
]
# Personal account indicators
PERSONAL_INDICATORS = [
r'(ceo|founder|co-founder|developer|builder|engineer|researcher)\s+at',
r'working\s+(at|on|with)',
r'^(i|i\'m|my)',
r'(my|personal)\s+(views|opinions|thoughts)',
r'\b(he/him|she/her|they/them)\b',
]
# Crypto/Web3 keywords
CRYPTO_KEYWORDS = [
'web3', 'crypto', 'blockchain', 'defi', 'nft', 'dao', 'dapp',
'ethereum', 'solana', 'bitcoin', 'polygon', 'base', 'arbitrum',
'smart contract', 'token', 'wallet', 'metamask', 'coinbase',
'yield', 'farming', 'staking', 'airdrop', 'whitelist', 'mint',
'protocol', 'l1', 'l2', 'rollup', 'zk', 'evm'
]
def __init__(self):
self.issues = []
self.stats = {
'total_matches': 0,
'false_positives': 0,
'questionable': 0,
'good_matches': 0,
'company_accounts': 0,
'weak_evidence': 0,
'context_mismatch': 0,
}
def parse_log(self) -> List[Dict]:
"""Parse the verification log into structured data"""
with open(LOG_FILE, 'r') as f:
content = f.read()
entries = []
# Find all sections that start with TELEGRAM USER
pattern = r'TELEGRAM USER: ([^\(]+) \(ID: (\d+)\)(.*?)(?=TELEGRAM USER:|$)'
matches = re.finditer(pattern, content, re.DOTALL)
for match in matches:
tg_username = match.group(1).strip()
tg_id = int(match.group(2))
section = match.group(3)
# Extract TG profile
tg_bio_match = re.search(r'Bio: (.*?)(?:\n+TWITTER CANDIDATES)', section, re.DOTALL)
tg_bio = tg_bio_match.group(1).strip() if tg_bio_match else ''
# Extract username specificity
spec_match = re.search(r'username has specificity score ([\d.]+)', section)
specificity = float(spec_match.group(1)) if spec_match else 0.5
# Extract candidates
candidates = []
candidate_blocks = re.findall(
r'\[Candidate (\d+)\](.*?)(?=\[Candidate \d+\]|LLM RESPONSE:)',
section,
re.DOTALL
)
for idx, block in candidate_blocks:
tw_username = re.search(r'Twitter Username: @(\S+)', block)
tw_name = re.search(r'Twitter Display Name: (.+)', block)
tw_bio = re.search(r'Twitter Bio: (.*?)(?=\nLocation:)', block, re.DOTALL)
tw_followers = re.search(r'Followers: ([\d,]+)', block)
match_method = re.search(r'Match Method: (\S+)', block)
baseline_conf = re.search(r'Baseline Confidence: ([\d.]+)', block)
candidates.append({
'index': int(idx),
'twitter_username': tw_username.group(1) if tw_username else '',
'twitter_name': tw_name.group(1).strip() if tw_name else '',
'twitter_bio': tw_bio.group(1).strip() if tw_bio else '',
'twitter_followers': tw_followers.group(1) if tw_followers else '0',
'match_method': match_method.group(1) if match_method else '',
'baseline_confidence': float(baseline_conf.group(1)) if baseline_conf else 0.0,
})
# Extract LLM response (handle multiline JSON with nested structures)
llm_match = re.search(r'LLM RESPONSE:\s*-+\s*(\{.*)', section, re.DOTALL)
if llm_match:
try:
# Extract JSON - it should be everything after "LLM RESPONSE:" until end of section
json_text = llm_match.group(1)
# Find the JSON object (balanced braces)
brace_count = 0
json_end = 0
for i, char in enumerate(json_text):
if char == '{':
brace_count += 1
elif char == '}':
brace_count -= 1
if brace_count == 0:
json_end = i + 1
break
if json_end > 0:
json_str = json_text[:json_end]
llm_result = json.loads(json_str)
else:
llm_result = {'candidates': []}
except Exception as e:
llm_result = {'candidates': []}
else:
llm_result = {'candidates': []}
entries.append({
'telegram_username': tg_username,
'telegram_id': tg_id,
'telegram_bio': tg_bio,
'username_specificity': specificity,
'candidates': candidates,
'llm_results': llm_result.get('candidates', [])
})
return entries
def is_company_account(self, bio: str, name: str) -> Tuple[bool, str]:
"""Detect if this is a company/product/team account"""
text = (bio + ' ' + name).lower()
for pattern in self.COMPANY_INDICATORS:
if re.search(pattern, text, re.IGNORECASE):
return True, f"Company pattern: '{pattern}'"
# Check if name equals bio description
if bio and len(bio.split()) < 20:
# Short bio describing what something "is"
if re.search(r'\bis\s+(a|an|the)\s+', bio, re.IGNORECASE):
return True, "Bio describes a product/service"
return False, ""
def is_personal_account(self, bio: str) -> bool:
"""Detect personal account indicators"""
for pattern in self.PERSONAL_INDICATORS:
if re.search(pattern, bio, re.IGNORECASE):
return True
return False
def has_crypto_context(self, bio: str) -> Tuple[bool, List[str]]:
"""Check if bio has crypto/web3 context"""
if not bio:
return False, []
bio_lower = bio.lower()
found_keywords = []
for keyword in self.CRYPTO_KEYWORDS:
if keyword in bio_lower:
found_keywords.append(keyword)
return len(found_keywords) > 0, found_keywords
def review_match(self, entry: Dict) -> Dict:
"""Critically review a single match"""
issues = []
severity = 'GOOD'
tg_username = entry['telegram_username']
tg_bio = entry['telegram_bio']
tg_has_crypto, tg_crypto_keywords = self.has_crypto_context(tg_bio)
# Review each LLM-approved match (confidence >= 0.5)
for llm_result in entry['llm_results']:
confidence = llm_result.get('confidence', 0)
if confidence < 0.5:
continue
self.stats['total_matches'] += 1
candidate_idx = llm_result.get('candidate_index', 0) - 1
if candidate_idx < 0 or candidate_idx >= len(entry['candidates']):
continue
candidate = entry['candidates'][candidate_idx]
tw_username = candidate['twitter_username']
tw_bio = candidate['twitter_bio']
tw_name = candidate['twitter_name']
match_method = candidate['match_method']
# Check 1: Company account
is_company, company_reason = self.is_company_account(tw_bio, tw_name)
if is_company:
issues.append({
'type': 'COMPANY_ACCOUNT',
'severity': 'HIGH',
'description': f"Twitter @{tw_username} appears to be a company/product account",
'evidence': company_reason,
'confidence': confidence
})
self.stats['company_accounts'] += 1
severity = 'FALSE_POSITIVE'
# Check 2: Context mismatch
tw_has_crypto, tw_crypto_keywords = self.has_crypto_context(tw_bio)
if tg_has_crypto and not tw_has_crypto:
issues.append({
'type': 'CONTEXT_MISMATCH',
'severity': 'MEDIUM',
'description': f"TG has crypto context but TW doesn't",
'evidence': f"TG keywords: {tg_crypto_keywords}, TW keywords: none",
'confidence': confidence
})
self.stats['context_mismatch'] += 1
if severity == 'GOOD':
severity = 'QUESTIONABLE'
# Check 3: Empty bio with no strong evidence
if not tg_bio and not tw_bio and confidence > 0.8:
issues.append({
'type': 'WEAK_EVIDENCE',
'severity': 'MEDIUM',
'description': f"High confidence ({confidence}) with both bios empty",
'evidence': f"Only username match, no contextual verification",
'confidence': confidence
})
self.stats['weak_evidence'] += 1
if severity == 'GOOD':
severity = 'QUESTIONABLE'
# Check 4: Generic username with high confidence
if entry['username_specificity'] < 0.6 and confidence > 0.85:
issues.append({
'type': 'GENERIC_USERNAME',
'severity': 'LOW',
'description': f"Generic username ({entry['username_specificity']:.2f} specificity) with high confidence",
'evidence': f"Username: {tg_username}",
'confidence': confidence
})
if severity == 'GOOD':
severity = 'QUESTIONABLE'
# Check 5: Twitter bio mentions other accounts
if match_method == 'twitter_bio_has_telegram':
# Check if the telegram username appears as @mention (not the account itself)
mentions = re.findall(r'@(\w+)', tw_bio)
if tg_username.lower() not in [m.lower() for m in mentions]:
# The username is embedded in another handle
issues.append({
'type': 'SUBSTRING_MATCH',
'severity': 'HIGH',
'description': f"TG username found as substring in other accounts, not direct mention",
'evidence': f"TW bio: {tw_bio[:100]}",
'confidence': confidence
})
self.stats['false_positives'] += 1
severity = 'FALSE_POSITIVE'
# Count severity
if severity == 'FALSE_POSITIVE':
self.stats['false_positives'] += 1
elif severity == 'QUESTIONABLE':
self.stats['questionable'] += 1
else:
self.stats['good_matches'] += 1
return {
'telegram_username': tg_username,
'telegram_id': entry['telegram_id'],
'severity': severity,
'issues': issues,
'entry': entry
}
def generate_report(self, reviews: List[Dict]):
"""Generate comprehensive review report"""
print()
print("=" * 100)
print("🔍 MATCH QUALITY REVIEW REPORT")
print("=" * 100)
print()
print("📊 STATISTICS:")
print(f" Total matches reviewed: {self.stats['total_matches']}")
print(f" ✅ Good matches: {self.stats['good_matches']} ({self.stats['good_matches']/max(self.stats['total_matches'],1)*100:.1f}%)")
print(f" ⚠️ Questionable: {self.stats['questionable']} ({self.stats['questionable']/max(self.stats['total_matches'],1)*100:.1f}%)")
print(f" ❌ False positives: {self.stats['false_positives']} ({self.stats['false_positives']/max(self.stats['total_matches'],1)*100:.1f}%)")
print()
print("🚨 ISSUE BREAKDOWN:")
print(f" Company accounts: {self.stats['company_accounts']}")
print(f" Context mismatches: {self.stats['context_mismatch']}")
print(f" Weak evidence: {self.stats['weak_evidence']}")
print()
# Show false positives
false_positives = [r for r in reviews if r['severity'] == 'FALSE_POSITIVE']
if false_positives:
print("=" * 100)
print("❌ FALSE POSITIVES:")
print("=" * 100)
for review in false_positives[:10]: # Show top 10
print()
print(f"TG @{review['telegram_username']} (ID: {review['telegram_id']})")
print(f"TG Bio: {review['entry']['telegram_bio'][:100]}")
for issue in review['issues']:
print(f" ❌ [{issue['severity']}] {issue['type']}: {issue['description']}")
print(f" Evidence: {issue['evidence'][:150]}")
print(f" LLM Confidence: {issue['confidence']:.2f}")
# Show questionable matches
questionable = [r for r in reviews if r['severity'] == 'QUESTIONABLE']
if questionable:
print()
print("=" * 100)
print("⚠️ QUESTIONABLE MATCHES:")
print("=" * 100)
for review in questionable[:10]: # Show top 10
print()
print(f"TG @{review['telegram_username']} (ID: {review['telegram_id']})")
for issue in review['issues']:
print(f" ⚠️ [{issue['severity']}] {issue['type']}: {issue['description']}")
print(f" Evidence: {issue['evidence'][:150]}")
print(f" LLM Confidence: {issue['confidence']:.2f}")
print()
print("=" * 100)
print("💡 RECOMMENDATIONS:")
print("=" * 100)
print()
if self.stats['company_accounts'] > 0:
print("1. Add company account detection to prompt:")
print(" - Check for product descriptions ('X is a platform for...')")
print(" - Look for 'official', 'team', 'hosted by' patterns")
print(" - Distinguish personal vs organizational accounts")
print()
if self.stats['context_mismatch'] > 0:
print("2. Strengthen context matching:")
print(" - Require crypto/web3 keywords in both profiles")
print(" - Lower confidence when contexts don't align")
print()
if self.stats['weak_evidence'] > 0:
print("3. Adjust confidence for weak evidence:")
print(" - Cap confidence at 0.70 when both bios are empty")
print(" - Require additional signals beyond username match")
print()
print("4. Fix 'twitter_bio_has_telegram' method:")
print(" - Only match direct @mentions, not substrings in other handles")
print(" - Example: @hipster should NOT match mentions of @HipsterHacker")
print()
def main():
reviewer = MatchReviewer()
print("📖 Parsing verification log...")
entries = reviewer.parse_log()
print(f"✅ Parsed {len(entries)} verification entries")
print()
print("🔍 Reviewing match quality...")
reviews = []
for entry in entries:
if entry['llm_results']: # Only review entries with matches
review = reviewer.review_match(entry)
reviews.append(review)
print(f"✅ Reviewed {len(reviews)} matches")
reviewer.generate_report(reviews)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,101 @@
-- Twitter-Telegram Matching Schema
-- Run this against your telegram_contacts database
-- Table: twitter_telegram_matches
-- Stores confirmed matches between Telegram and Twitter profiles
CREATE TABLE IF NOT EXISTS twitter_telegram_matches (
id SERIAL PRIMARY KEY,
-- Telegram side
telegram_user_id BIGINT NOT NULL REFERENCES contacts(user_id),
telegram_username VARCHAR(255),
telegram_name VARCHAR(255),
telegram_bio TEXT,
-- Twitter side
twitter_id VARCHAR(50) NOT NULL,
twitter_username VARCHAR(100) NOT NULL,
twitter_name VARCHAR(200),
twitter_bio TEXT,
twitter_location VARCHAR(200),
twitter_verified BOOLEAN,
twitter_blue_verified BOOLEAN,
twitter_followers_count INTEGER,
-- Matching metadata
match_method VARCHAR(50) NOT NULL, -- 'exact_bio_handle', 'exact_username', 'username_variation', 'fuzzy_name'
baseline_confidence FLOAT NOT NULL, -- Confidence before LLM (0-1)
llm_verdict VARCHAR(20) NOT NULL, -- 'YES', 'NO', 'UNSURE'
final_confidence FLOAT NOT NULL CHECK (final_confidence BETWEEN 0 AND 1),
-- Match details (JSON for flexibility)
match_details JSONB, -- {extracted_handles: [...], username_variation: 'xxx', fuzzy_score: 0.85}
-- LLM metadata
llm_tokens_used INTEGER,
llm_cost FLOAT,
-- Audit trail
matched_at TIMESTAMP DEFAULT NOW(),
needs_manual_review BOOLEAN DEFAULT FALSE,
verified_manually BOOLEAN DEFAULT FALSE,
manual_review_notes TEXT,
UNIQUE(telegram_user_id, twitter_id)
);
-- Indexes for performance
CREATE INDEX IF NOT EXISTS idx_ttm_telegram_user ON twitter_telegram_matches(telegram_user_id);
CREATE INDEX IF NOT EXISTS idx_ttm_twitter_id ON twitter_telegram_matches(twitter_id);
CREATE INDEX IF NOT EXISTS idx_ttm_twitter_username ON twitter_telegram_matches(LOWER(twitter_username));
CREATE INDEX IF NOT EXISTS idx_ttm_confidence ON twitter_telegram_matches(final_confidence DESC);
CREATE INDEX IF NOT EXISTS idx_ttm_verdict ON twitter_telegram_matches(llm_verdict);
CREATE INDEX IF NOT EXISTS idx_ttm_method ON twitter_telegram_matches(match_method);
CREATE INDEX IF NOT EXISTS idx_ttm_needs_review ON twitter_telegram_matches(needs_manual_review) WHERE needs_manual_review = TRUE;
-- Table: twitter_match_candidates (temporary staging)
-- Stores potential matches before LLM verification
CREATE TABLE IF NOT EXISTS twitter_match_candidates (
id SERIAL PRIMARY KEY,
telegram_user_id BIGINT NOT NULL REFERENCES contacts(user_id),
-- Twitter candidate info
twitter_id VARCHAR(50) NOT NULL,
twitter_username VARCHAR(100) NOT NULL,
twitter_name VARCHAR(200),
twitter_bio TEXT,
twitter_location VARCHAR(200),
twitter_verified BOOLEAN,
twitter_blue_verified BOOLEAN,
twitter_followers_count INTEGER,
-- Candidate scoring
candidate_rank INTEGER, -- 1 = best match, 2 = second best, etc.
match_method VARCHAR(50),
baseline_confidence FLOAT,
match_signals JSONB, -- {'handle_match': true, 'fuzzy_score': 0.85, ...}
-- LLM processing status
needs_llm_review BOOLEAN DEFAULT TRUE,
llm_processed BOOLEAN DEFAULT FALSE,
llm_verdict VARCHAR(20),
final_confidence FLOAT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_tmc_telegram_user ON twitter_match_candidates(telegram_user_id);
CREATE INDEX IF NOT EXISTS idx_tmc_twitter_id ON twitter_match_candidates(twitter_id);
CREATE INDEX IF NOT EXISTS idx_tmc_needs_review ON twitter_match_candidates(llm_processed, needs_llm_review)
WHERE needs_llm_review = TRUE AND llm_processed = FALSE;
CREATE INDEX IF NOT EXISTS idx_tmc_rank ON twitter_match_candidates(telegram_user_id, candidate_rank);
-- Grant permissions (adjust as needed for your setup)
-- GRANT ALL PRIVILEGES ON twitter_telegram_matches TO your_user;
-- GRANT ALL PRIVILEGES ON twitter_match_candidates TO your_user;
-- GRANT USAGE, SELECT ON SEQUENCE twitter_telegram_matches_id_seq TO your_user;
-- GRANT USAGE, SELECT ON SEQUENCE twitter_match_candidates_id_seq TO your_user;
COMMENT ON TABLE twitter_telegram_matches IS 'Confirmed matches between Telegram and Twitter profiles';
COMMENT ON TABLE twitter_match_candidates IS 'Temporary staging for potential matches awaiting LLM verification';

791
verify_twitter_matches_v2.py Executable file
View File

@@ -0,0 +1,791 @@
#!/usr/bin/env python3
"""
Twitter-Telegram Match Verifier V2 (Confidence-Based with Batch Evaluation)
Uses LLM with confidence scoring (0-1) and evaluates all candidates for a TG user together
"""
import sys
import asyncio
import json
from pathlib import Path
from typing import List, Dict
import psycopg2
from psycopg2.extras import DictCursor, RealDictCursor
import openai
from datetime import datetime
import re
# Add parent directory to path
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
from db_config import SessionLocal
from models import Contact
# Twitter database connection
TWITTER_DB_CONFIG = {
'dbname': 'twitter_data',
'user': 'andrewjiang',
'host': 'localhost',
'port': 5432
}
# Checkpoint file
CHECKPOINT_FILE = Path(__file__).parent.parent / 'llm_verification_v2_checkpoint.json'
class CheckpointManager:
"""Manage checkpoints for resumable verification"""
def __init__(self, checkpoint_file):
self.checkpoint_file = checkpoint_file
self.data = self.load()
def load(self):
if self.checkpoint_file.exists():
with open(self.checkpoint_file, 'r') as f:
return json.load(f)
return {
'last_processed_telegram_id': None,
'processed_count': 0,
'total_matches_saved': 0,
'total_cost': 0.0,
'total_tokens': 0,
'started_at': None
}
def save(self):
self.data['last_updated_at'] = datetime.now().isoformat()
with open(self.checkpoint_file, 'w') as f:
json.dump(self.data, f, indent=2)
def update(self, telegram_user_id, matches_saved, tokens_used, cost):
self.data['last_processed_telegram_id'] = telegram_user_id
self.data['processed_count'] += 1
self.data['total_matches_saved'] += matches_saved
self.data['total_tokens'] += tokens_used
self.data['total_cost'] += cost
# Save every 20 telegram users
if self.data['processed_count'] % 20 == 0:
self.save()
def calculate_username_specificity(username: str) -> float:
"""
Calculate how specific/unique a username is (0-1)
Generic usernames get lower scores, unique ones get higher scores
"""
if not username:
return 0.5
username_lower = username.lower()
# Very generic patterns
generic_patterns = [
r'^admin\d*$', r'^user\d+$', r'^crypto\d*$', r'^nft\d*$',
r'^web3\d*$', r'^trader\d*$', r'^dev\d*$', r'^official\d*$',
r'^team\d*$', r'^support\d*$', r'^info\d*$'
]
for pattern in generic_patterns:
if re.match(pattern, username_lower):
return 0.3
# Length-based scoring
length = len(username)
if length < 4:
return 0.4
elif length < 6:
return 0.6
elif length < 8:
return 0.75
else:
return 0.9
class LLMVerifier:
"""Use GPT-5 Nano for confidence-based verification"""
def __init__(self):
self.client = openai.AsyncOpenAI()
self.model = "gpt-5-mini" # GPT-5 Mini - balanced performance and cost
# Cost calculation for gpt-5-mini
self.input_cost_per_1m = 0.25 # $0.25 per 1M input tokens
self.output_cost_per_1m = 2.00 # $2.00 per 1M output tokens
def build_batch_prompt(self, telegram_profile: Dict, candidates: List[Dict]) -> str:
"""Build LLM prompt for batch evaluation of all candidates"""
# Format Telegram profile
tg_bio = telegram_profile.get('bio', 'none')
if tg_bio and len(tg_bio) > 300:
tg_bio = tg_bio[:300] + '...'
# Format chat context
chat_context = telegram_profile.get('chat_context', {})
chat_info = ""
if chat_context.get('chat_titles'):
chat_info = f"\nGroup Chats: {chat_context['chat_titles']}"
if chat_context.get('is_crypto_focused'):
chat_info += " [CRYPTO-FOCUSED GROUPS]"
prompt = f"""TELEGRAM PROFILE:
Username: @{telegram_profile.get('username') or 'none'}
Display Name: {telegram_profile.get('name', 'none')} (first_name + last_name combined)
Bio: {tg_bio}{chat_info}
TWITTER CANDIDATES (evaluate all together):
"""
for i, candidate in enumerate(candidates, 1):
tw_bio = candidate.get('twitter_bio', 'none')
if tw_bio and len(tw_bio) > 250:
tw_bio = tw_bio[:250] + '...'
# Check if phash match info exists
phash_info = ""
if candidate['match_method'] == 'phash_match':
phash_distance = candidate.get('match_signals', {}).get('phash_distance', 'unknown')
phash_info = f"\n⭐ Phash Match: distance={phash_distance} (identical profile pictures!)"
prompt += f"""
[Candidate {i}]
Twitter Username: @{candidate['twitter_username']}
Twitter Display Name: {candidate.get('twitter_name', 'Unknown')}
Twitter Bio: {tw_bio}
Location: {candidate.get('twitter_location') or 'none'}
Verified: {candidate.get('twitter_verified', False)} (Blue: {candidate.get('twitter_blue_verified', False)})
Followers: {candidate.get('twitter_followers_count', 0):,}
Match Method: {candidate['match_method']}{phash_info}
Baseline Confidence: {candidate.get('baseline_confidence', 0):.2f}
"""
return prompt
async def verify_batch(self, telegram_profile: Dict, candidates: List[Dict], semaphore, log_file=None) -> Dict:
"""Verify all candidates for a single Telegram user"""
async with semaphore:
prompt = self.build_batch_prompt(telegram_profile, candidates)
if log_file:
log_file.write(f"\n{'=' * 100}\n")
log_file.write(f"TELEGRAM USER: {telegram_profile.get('username', 'N/A')} (ID: {telegram_profile['user_id']})\n")
log_file.write(f"{'=' * 100}\n\n")
system_prompt = """You are an expert at determining if two social media profiles belong to the same person.
# TASK
Determine confidence (0.0-1.0) that each Twitter candidate is the same person as the Telegram profile.
**CRITICAL: Evaluate ALL candidates together, not in isolation. Compare them against each other to identify which has the STRONGEST evidence.**
# SIGNAL STRENGTH GUIDE
Evaluate the FULL CONTEXT of all available signals together. Individual signals can be strong or weak, but the overall picture matters most.
## VERY STRONG SIGNALS (Can individually suggest high confidence)
- **Explicit bio mention**: TG bio says "x.com/username" or "Follow me @username"
- ⚠️ EXCEPTION: If that account is clearly a company/project (not personal), this is NOT definitive
- Example: TG bio "x.com/gems_gun" but @gems_gun is company account → Look for personal account like @lucahl0 "Building @gems_gun"
- **Unique username exact match**: Unusual/long username (like @kupermind, @schellinger_k) that matches exactly
- Generic usernames (@mike, @crypto123) don't qualify as "unique"
## STRONG SUPPORTING SIGNALS (Good indicators when combined)
Each of these helps build confidence, especially when multiple align:
- Full name match (after normalization: remove .eth, emojis, separators)
- Same profile picture (phash match)
- Aligned bio themes/context (both in crypto, both mention same projects/interests)
- Very similar username (not exact, but close: @kevin vs @k_kevin)
## WEAK SIGNALS (Need multiple strong signals to be meaningful)
- Generic name match only (Alex, Mike, David, John, Baz)
- Same general field but no specifics
- Partial username similarity with generic name
## RED FLAGS (Lower confidence significantly)
- Context mismatch: TG is crypto/tech, TW is chef/athlete/journalist
- Company account when looking for personal profile
- Famous person/celebrity (unless clear evidence it's actually them)
# CONFIDENCE BANDS
## 0.90-1.0: NEARLY CERTAIN
Very strong signal (bio mention of personal account OR unique username match) + supporting signals align
Examples:
- TG bio: "https://x.com/kupermind" + TW @kupermind personal account → 0.97
- TG @olliten + TW @olliten + same name + same pic → 0.98
## 0.70-0.89: LIKELY
Multiple strong supporting signals converge, or one very strong signal with some gap
Examples:
- Very similar username + name match + context: @schellinger → @k_schellinger "Kevin Schellinger" → 0.85
- Exact username on moderately unique name: @alexcrypto + "Alex Smith" crypto → 0.78
## 0.40-0.69: POSSIBLE
Some evidence but significant uncertainty
- Generic name + same field but no username/pic match
- Weak username similarity with generic name
- Profile pic match but name is very generic
## 0.10-0.39: UNLIKELY
Minimal evidence or contradictions
- Only generic name match (David, Alex, Mike)
- Context mismatch (crypto person vs chef)
## 0.0-0.09: EXTREMELY UNLIKELY
No meaningful evidence or clear contradiction
# COMPARATIVE EVALUATION PROCESS
**Step 1: Review ALL candidates together**
Don't score each in isolation. Look at the full set to understand which has the strongest evidence.
**Step 2: Identify the strongest signals present**
- Is there a bio mention? (Check if it's personal vs company account!)
- Is there a unique username match?
- Do multiple supporting signals converge for one candidate?
**Step 3: Apply differential scoring**
- The candidate with STRONGEST evidence should get meaningfully higher score
- If Candidate A has unique username + name match, and Candidate B only has generic name → A gets 0.85+, B gets 0.40 max
- If ALL candidates only have weak signals (generic name only) → ALL score 0.20-0.40
**Step 4: Sanity checks**
- Could this evidence match thousands of people? → Lower confidence
- Is there a context mismatch? → Max 0.50
- Is this a company account when we need personal? → Not the right match
**Key principle: Only ONE candidate can be "most likely" - differentiate clearly between them.**
# TECHNICAL NOTES
**Name Normalization**: Before comparing, remove .eth/.ton/.sol suffixes, emojis, "| company" separators, and ignore capitalization
**Profile Picture (phash)**: Phash match alone → MAX 0.70 (supporting signal). Use to break ties or add confidence to other signals.
# OUTPUT FORMAT
Return ONLY valid JSON (no markdown, no explanation):
{{
"candidates": [
{{
"candidate_index": 1,
"confidence": 0.85,
"reasoning": "Brief explanation"
}},
...
]
}}"""
try:
if log_file:
log_file.write("SYSTEM PROMPT:\n")
log_file.write("-" * 100 + "\n")
log_file.write(system_prompt + "\n\n")
log_file.write("USER PROMPT:\n")
log_file.write("-" * 100 + "\n")
log_file.write(prompt + "\n\n")
log_file.flush()
response = await self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
content = response.choices[0].message.content.strip()
if log_file:
log_file.write("LLM RESPONSE:\n")
log_file.write("-" * 100 + "\n")
log_file.write(content + "\n\n")
log_file.flush()
# Parse JSON
try:
result = json.loads(content)
except json.JSONDecodeError:
print(f" ⚠️ Failed to parse JSON response")
return {
'success': False,
'error': 'json_parse_error',
'tokens_used': response.usage.total_tokens,
'cost': self.calculate_cost(response.usage)
}
tokens_used = response.usage.total_tokens
cost = self.calculate_cost(response.usage)
return {
'success': True,
'results': result.get('candidates', []),
'tokens_used': tokens_used,
'cost': cost,
'error': None
}
except Exception as e:
print(f" ⚠️ LLM error: {str(e)[:100]}")
return {
'success': False,
'error': str(e),
'tokens_used': 0,
'cost': 0.0
}
def calculate_cost(self, usage) -> float:
"""Calculate cost for this API call"""
input_cost = (usage.prompt_tokens / 1_000_000) * self.input_cost_per_1m
output_cost = (usage.completion_tokens / 1_000_000) * self.output_cost_per_1m
return input_cost + output_cost
def get_telegram_users_with_candidates(telegram_conn, checkpoint_manager, limit=None):
"""Get list of telegram_user_ids that have unprocessed candidates"""
with telegram_conn.cursor() as cur:
query = """
SELECT DISTINCT telegram_user_id
FROM twitter_match_candidates
WHERE needs_llm_review = TRUE
AND llm_processed = FALSE
"""
if checkpoint_manager.data['last_processed_telegram_id']:
query += f" AND telegram_user_id > {checkpoint_manager.data['last_processed_telegram_id']}"
query += " ORDER BY telegram_user_id"
if limit:
query += f" LIMIT {limit}"
cur.execute(query)
return [row[0] for row in cur.fetchall()]
def get_candidates_for_telegram_user(telegram_user_id: int, telegram_conn):
"""Get all candidates for a specific Telegram user"""
with telegram_conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute("""
SELECT *
FROM twitter_match_candidates
WHERE telegram_user_id = %s
AND needs_llm_review = TRUE
AND llm_processed = FALSE
ORDER BY baseline_confidence DESC
""", (telegram_user_id,))
return [dict(row) for row in cur.fetchall()]
def get_user_chat_context(user_id: int, telegram_conn) -> Dict:
"""Get chat participation context for a user"""
with telegram_conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute("""
SELECT
STRING_AGG(DISTINCT c.title, ' | ') FILTER (WHERE c.title IS NOT NULL) as chat_titles,
COUNT(DISTINCT cp.chat_id) as chat_count
FROM chat_participants cp
JOIN chats c ON cp.chat_id = c.chat_id
WHERE cp.user_id = %s
AND c.title IS NOT NULL
AND c.chat_type != 'private'
""", (user_id,))
result = cur.fetchone()
if result and result['chat_titles']:
# Check if chats indicate crypto/web3 interest
chat_titles_lower = result['chat_titles'].lower()
crypto_keywords = ['crypto', 'bitcoin', 'eth', 'defi', 'nft', 'dao', 'web3', 'blockchain',
'solana', 'near', 'avalanche', 'polygon', 'base', 'arbitrum', 'optimism',
'cosmos', 'builders', 'degen', 'lobster']
is_crypto_focused = any(keyword in chat_titles_lower for keyword in crypto_keywords)
return {
'chat_titles': result['chat_titles'],
'chat_count': result['chat_count'],
'is_crypto_focused': is_crypto_focused
}
return {'chat_titles': None, 'chat_count': 0, 'is_crypto_focused': False}
async def verify_telegram_user(telegram_user_id: int, verifier: LLMVerifier, checkpoint_manager: CheckpointManager,
telegram_db, telegram_conn, log_file=None) -> Dict:
"""Verify all candidates for a single Telegram user"""
# Get Telegram profile
telegram_profile = telegram_db.query(Contact).filter(
Contact.user_id == telegram_user_id
).first()
if not telegram_profile:
return {'matches_saved': 0, 'tokens': 0, 'cost': 0}
# Construct display name from first_name + last_name
display_name = (telegram_profile.first_name or '') + (' ' + telegram_profile.last_name if telegram_profile.last_name else '')
display_name = display_name.strip() or None
# Get chat participation context
chat_context = get_user_chat_context(telegram_user_id, telegram_conn)
telegram_dict = {
'user_id': telegram_profile.user_id,
'account_id': telegram_profile.account_id,
'username': telegram_profile.username,
'name': display_name,
'bio': telegram_profile.bio,
'chat_context': chat_context
}
# Get all candidates
candidates = get_candidates_for_telegram_user(telegram_user_id, telegram_conn)
if not candidates:
return {'matches_saved': 0, 'tokens': 0, 'cost': 0}
# Verify with LLM
semaphore = asyncio.Semaphore(1) # One at a time for now
llm_result = await verifier.verify_batch(telegram_dict, candidates, semaphore, log_file)
if not llm_result['success']:
# Mark as processed even on error so we don't retry infinitely
mark_candidates_processed([c['id'] for c in candidates], telegram_conn)
return {
'matches_saved': 0,
'tokens': llm_result['tokens_used'],
'cost': llm_result['cost']
}
# Save matches
matches_saved = save_verified_matches(
telegram_dict,
candidates,
llm_result['results'],
telegram_conn
)
# Mark candidates as processed
mark_candidates_processed([c['id'] for c in candidates], telegram_conn)
return {
'matches_saved': matches_saved,
'tokens': llm_result['tokens_used'],
'cost': llm_result['cost']
}
def save_verified_matches(telegram_profile: Dict, candidates: List[Dict], llm_results: List[Dict], telegram_conn):
"""Save verified matches with confidence scores"""
matches_to_save = []
# CRITICAL: Post-process to fix generic name false positives
tg_name = telegram_profile.get('name', '')
tg_username = telegram_profile.get('username', '')
tg_bio = telegram_profile.get('bio', '')
# Check if profile has generic characteristics
has_generic_name = len(tg_name) <= 7 # Short display name
has_generic_username = len(tg_username) <= 8 # Short username
has_empty_bio = not tg_bio or len(tg_bio) <= 20
is_generic_profile = has_empty_bio and (has_generic_name or has_generic_username)
for llm_result in llm_results:
candidate_idx = llm_result.get('candidate_index', 0) - 1 # Convert to 0-indexed
if candidate_idx < 0 or candidate_idx >= len(candidates):
continue
confidence = llm_result.get('confidence', 0)
reasoning = llm_result.get('reasoning', '')
# Only save if confidence >= 0.5 (moderate or higher)
if confidence < 0.5:
continue
candidate = candidates[candidate_idx]
# CRITICAL FIX: Cap confidence for generic profiles with weak match methods
# Weak match methods are those based purely on name/username containment
weak_match_methods = [
'display_name_containment',
'fuzzy_name',
'tg_username_in_twitter_name',
'twitter_username_in_tg_name'
]
if is_generic_profile and candidate['match_method'] in weak_match_methods:
# Cap at 0.70 unless it's a strong signal (phash, exact_username, exact_bio_handle)
if confidence > 0.70:
confidence = 0.70
reasoning += " [Confidence capped at 0.70: generic profile + weak match method]"
match_details = {
'match_method': candidate['match_method'],
'baseline_confidence': candidate['baseline_confidence'],
'llm_confidence': confidence,
'llm_reasoning': reasoning
}
matches_to_save.append((
telegram_profile['account_id'],
telegram_profile['user_id'],
telegram_profile.get('username'),
telegram_profile.get('name'),
telegram_profile.get('bio'),
candidate['twitter_id'],
candidate['twitter_username'],
candidate.get('twitter_name'),
candidate.get('twitter_bio'),
candidate.get('twitter_location'),
candidate.get('twitter_verified', False),
candidate.get('twitter_blue_verified', False),
candidate.get('twitter_followers_count', 0),
candidate['match_method'],
candidate['baseline_confidence'],
'CONFIDENT' if confidence >= 0.8 else 'MODERATE' if confidence >= 0.6 else 'UNSURE',
confidence,
json.dumps(match_details),
llm_result.get('tokens_used', 0),
0, # cost will be aggregated at the batch level
confidence < 0.75 # needs_manual_review
))
if matches_to_save:
with telegram_conn.cursor() as cur:
cur.executemany("""
INSERT INTO twitter_telegram_matches (
account_id,
telegram_user_id,
telegram_username,
telegram_name,
telegram_bio,
twitter_id,
twitter_username,
twitter_name,
twitter_bio,
twitter_location,
twitter_verified,
twitter_blue_verified,
twitter_followers_count,
match_method,
baseline_confidence,
llm_verdict,
final_confidence,
match_details,
llm_tokens_used,
llm_cost,
needs_manual_review
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (telegram_user_id, twitter_id) DO UPDATE SET
llm_verdict = EXCLUDED.llm_verdict,
final_confidence = EXCLUDED.final_confidence,
matched_at = NOW()
""", matches_to_save)
telegram_conn.commit()
return len(matches_to_save)
def mark_candidates_processed(candidate_ids: List[int], telegram_conn):
"""Mark candidates as LLM processed"""
with telegram_conn.cursor() as cur:
cur.execute("""
UPDATE twitter_match_candidates
SET llm_processed = TRUE
WHERE id = ANY(%s)
""", (candidate_ids,))
telegram_conn.commit()
async def main():
print()
print("=" * 70)
print("🤖 Twitter-Telegram Match Verifier V2 (Confidence-Based)")
print("=" * 70)
print()
# Check arguments
test_mode = '--test' in sys.argv
verbose = '--verbose' in sys.argv
limit = 100 if test_mode else None
# Check for concurrency argument
concurrent_requests = 10 # Default
for i, arg in enumerate(sys.argv):
if arg == '--concurrent' and i + 1 < len(sys.argv):
try:
concurrent_requests = int(sys.argv[i + 1])
except ValueError:
pass
# Setup verbose logging
log_file = None
if verbose:
log_file = open(Path(__file__).parent.parent / 'verification_v2_log.txt', 'w')
log_file.write("=" * 100 + "\n")
log_file.write("VERIFICATION V2 LOG\n")
log_file.write("=" * 100 + "\n\n")
if test_mode:
print("🧪 TEST MODE: Processing first 100 Telegram users only")
print()
# Initialize checkpoint
checkpoint_manager = CheckpointManager(CHECKPOINT_FILE)
if checkpoint_manager.data['last_processed_telegram_id']:
print(f"📍 Resuming from telegram_user_id {checkpoint_manager.data['last_processed_telegram_id']}")
print(f" Already processed: {checkpoint_manager.data['processed_count']:,}")
print(f" Cost so far: ${checkpoint_manager.data['total_cost']:.4f}")
print()
else:
checkpoint_manager.data['started_at'] = datetime.now().isoformat()
# Connect to databases
print("📡 Connecting to databases...")
telegram_db = SessionLocal()
try:
telegram_conn = psycopg2.connect(dbname='telegram_contacts', user='andrewjiang', host='localhost', port=5432)
telegram_conn.autocommit = False
except Exception as e:
print(f"❌ Failed to connect to Telegram database: {e}")
return False
print("✅ Connected")
print()
try:
# Load telegram users needing verification
print("🔍 Loading telegram users with candidates...")
telegram_user_ids = get_telegram_users_with_candidates(telegram_conn, checkpoint_manager, limit)
if not telegram_user_ids:
print("✅ No users to verify!")
return True
print(f"✅ Found {len(telegram_user_ids):,} telegram users to verify")
print()
# Estimate cost (rough)
estimated_cost = len(telegram_user_ids) * 0.003 # ~$0.003 per user
print(f"💰 Estimated cost: ${estimated_cost:.4f}")
print()
# Initialize verifier
verifier = LLMVerifier()
print("🚀 Starting LLM verification...")
print(f"⚡ Concurrent requests: {concurrent_requests}")
print()
# Configuration for parallel processing
CONCURRENT_REQUESTS = concurrent_requests # Process N users at a time
BATCH_SIZE = 50 # Save checkpoint every 50 users
# Process users in parallel batches
total_users = len(telegram_user_ids)
processed_count = 0
for batch_start in range(0, total_users, BATCH_SIZE):
batch_end = min(batch_start + BATCH_SIZE, total_users)
batch = telegram_user_ids[batch_start:batch_end]
print(f"📦 Processing batch {batch_start//BATCH_SIZE + 1}/{(total_users + BATCH_SIZE - 1)//BATCH_SIZE} ({len(batch)} users)...")
# Create tasks for concurrent processing
tasks = []
for telegram_user_id in batch:
task = verify_telegram_user(
telegram_user_id,
verifier,
checkpoint_manager,
telegram_db,
telegram_conn,
log_file
)
tasks.append((telegram_user_id, task))
# Process batch concurrently with limit
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
async def process_with_semaphore(user_id, task):
async with semaphore:
return user_id, await task
results = await asyncio.gather(
*[process_with_semaphore(user_id, task) for user_id, task in tasks],
return_exceptions=True
)
# Process results and update checkpoints
for i, result in enumerate(results):
processed_count += 1
user_id = batch[i]
if isinstance(result, Exception):
print(f"[{processed_count}/{total_users}] ❌ User {user_id} failed: {result}")
continue
user_id_result, verification_result = result
print(f"[{processed_count}/{total_users}] ✅ User {user_id_result}: {verification_result['matches_saved']} matches | ${verification_result['cost']:.4f}")
# Update checkpoint
checkpoint_manager.update(
user_id_result,
verification_result['matches_saved'],
verification_result['tokens'],
verification_result['cost']
)
print(f" Batch complete. Total processed: {processed_count}/{total_users}")
print()
# Final stats
print()
print("=" * 70)
print("✅ VERIFICATION COMPLETE")
print("=" * 70)
print()
print(f"📊 Statistics:")
print(f" Processed: {checkpoint_manager.data['processed_count']:,} telegram users")
print(f" 💾 Saved matches: {checkpoint_manager.data['total_matches_saved']:,}")
print()
print(f"💰 Cost:")
print(f" Total tokens: {checkpoint_manager.data['total_tokens']:,}")
print(f" Total cost: ${checkpoint_manager.data['total_cost']:.4f}")
print()
# Clean up checkpoint
CHECKPOINT_FILE.unlink(missing_ok=True)
return True
except Exception as e:
print(f"❌ Error: {e}")
import traceback
traceback.print_exc()
return False
finally:
if log_file:
log_file.close()
telegram_db.close()
telegram_conn.close()
checkpoint_manager.save()
if __name__ == "__main__":
try:
success = asyncio.run(main())
sys.exit(0 if success else 1)
except KeyboardInterrupt:
print("\n\n⚠️ Interrupted by user")
print("💾 Progress saved - you can resume by running this script again")
sys.exit(1)