This module provides comprehensive Twitter-to-Telegram profile matching and verification using 10 different matching methods and LLM verification. Features: - 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names) - URL resolution integration for t.co → t.me links - Async LLM verification with GPT-5-mini - Interactive menu system with real-time stats - Threaded candidate finding (~1.5 contacts/sec) - Comprehensive documentation and guides Key Components: - find_twitter_candidates.py: Core matching logic (10 methods) - find_twitter_candidates_threaded.py: Threaded implementation - verify_twitter_matches_v2.py: LLM verification (V5 prompt) - review_match_quality.py: Analysis and quality review - main.py: Interactive menu system - Complete documentation (README, CHANGELOG, QUICKSTART) Performance: - Candidate finding: ~16-18 hours for 43K contacts - LLM verification: ~23 hours for 43K users - Cost: ~$130 for full verification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Twitter-Telegram Profile Matching System
A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.
Overview
This system operates in two main steps:
- Candidate Finding: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
- LLM Verification: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)
Quick Start
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
python3.10 main.py
Matching Methods
The system uses 10 different methods to find Twitter candidates:
High Confidence Methods (0.90-0.95)
-
Phash Match (0.95 for exact, 0.88 for distance=1)
- Compares profile picture hashes
- Pre-computed in
telegram_twitter_phash_matchestable
-
Exact Bio Handle (0.95)
- Extracts Twitter handles from Telegram bio
- Patterns:
@username,twitter.com/username,x.com/username
-
Bio URL Resolution (0.95) ⭐ NEW
- Twitter bio contains shortened URL (t.co/xyz) that resolves to
t.me/username - Queries
url_resolution_queuetable - Captures matches even when usernames differ
- Twitter bio contains shortened URL (t.co/xyz) that resolves to
-
Twitter Bio Has Telegram (0.92)
- Reverse lookup: Twitter bio mentions Telegram username
- Searches for
@username,t.me/username,telegram.me/username
-
Display Name Containment (0.92)
- Telegram name contained within Twitter display name
-
Exact Username (0.90)
- Telegram username exactly matches Twitter username
Medium Confidence Methods (0.80-0.88)
- TG Username in Twitter Name (0.88)
- Twitter Username in TG Name (0.86)
- Fuzzy Name (0.65-0.85)
- PostgreSQL trigram similarity with 0.65 threshold
- Username Variation (0.80)
- Generates variations (remove underscores, flip numbers, etc.)
LLM Verification
The system uses GPT-5-mini with a sophisticated V6 prompt that:
- Evaluates ALL candidates together (comparative evaluation)
- Applies differential scoring (only one can be "most likely")
- Distinguishes between personal and company accounts
- Considers signal strength holistically
- Only saves matches with 70%+ confidence
Files
Core Scripts
main.py- Interactive menu for running the systemfind_twitter_candidates.py- Core matching logic (TwitterMatcher class)find_twitter_candidates_threaded.py- Threaded implementation (RECOMMENDED)verify_twitter_matches_v2.py- LLM verification with async (RECOMMENDED)review_match_quality.py- Analyze match quality and statistics
Database Schema
setup_twitter_matching_schema.sql- Database tables and indexes
Database Tables
twitter_match_candidates
Stores all potential matches found by the matching methods.
Key fields:
telegram_user_id- Telegram contact user IDtwitter_id- Twitter profile IDmatch_method- Which method found this candidatebaseline_confidence- Initial confidence (0.0-1.0)match_signals- JSON with match detailsllm_processed- Whether LLM has evaluated this candidate
twitter_telegram_matches
Stores verified matches (70%+ confidence from LLM).
Key fields:
telegram_user_id- Telegram contacttwitter_id- Matched Twitter profilefinal_confidence- LLM-assigned confidence (0.70-1.0)llm_verdict- LLM reasoningmatch_method- Original matching methodmatched_at- Timestamp
url_resolution_queue
Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).
Key fields:
twitter_id- Twitter profile IDoriginal_url- Shortened URL (e.g., t.co/abc)resolved_url- Full URL (e.g., https://t.me/username)telegram_handles- Extracted Telegram handles (JSONB array)
Usage Examples
Find Candidates for All Contacts (Threaded)
python3.10 find_twitter_candidates_threaded.py --workers 8
Find Candidates for First 1000 Contacts
python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
Verify Matches with LLM (100 concurrent requests)
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
Test Mode (50 users, 10 concurrent)
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
Review Match Quality
python3.10 review_match_quality.py
Performance
Candidate Finding (Threaded)
- Speed: ~1.5 contacts/sec
- Time for 43K contacts: ~16-18 hours
- Workers: 8 (default, configurable)
LLM Verification (Async)
- Speed: ~32 users/minute with 100 concurrent requests
- Cost: ~$0.003 per user (GPT-5-mini)
- Time for 43K users: ~23 hours
Recent Improvements
V6 Prompt (Latest)
- Upfront directive for comparative evaluation
- Clear signal strength hierarchy
- Company vs personal account differentiation
- Streamlined from ~135 to ~90 lines while being clearer
URL Resolution Integration
- Added Method 5b: Bio URL resolution
- Captures 140+ additional matches
- Especially valuable when usernames differ
- 0.95 baseline confidence (very high)
Configuration
Environment variables (in /Users/andrewjiang/Bao/TimeToLockIn/Profile/.env):
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-5-mini
Database connections:
telegram_contacts- Telegram contact datatwitter_data- Twitter profile data
Tips
- Always run threaded candidate finding - 10-20x faster than single-threaded
- Use high concurrency for LLM verification - 100+ concurrent requests for optimal speed
- Monitor costs - Check OpenAI usage during verification
- Review match quality periodically - Use
review_match_quality.pyto analyze results - Test first - Use
--test --limit 50flags before full runs
Troubleshooting
LLM verification is slow
- Increase
--concurrentparameter (try 100-200) - Check OpenAI rate limits (1,000 RPM for Tier 1)
Many low-quality matches
- Review and adjust V6 prompt in
verify_twitter_matches_v2.py - Check
review_match_quality.pyfor insights
Missing obvious matches
- Check if candidate was found: Query
twitter_match_candidates - If not found, may need new matching method
- If found but not verified, check LLM reasoning in
llm_verdict
Future Enhancements
- Add more matching methods (location, bio keywords, etc.)
- Implement feedback loop for prompt improvement
- Add manual review interface for borderline matches
- Export matches to various formats