# Twitter-Telegram Profile Matching System A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification. ## Overview This system operates in two main steps: 1. **Candidate Finding**: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods 2. **LLM Verification**: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0) ## Quick Start ```bash cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching python3.10 main.py ``` ## Matching Methods The system uses 10 different methods to find Twitter candidates: ### High Confidence Methods (0.90-0.95) 1. **Phash Match** (0.95 for exact, 0.88 for distance=1) - Compares profile picture hashes - Pre-computed in `telegram_twitter_phash_matches` table 2. **Exact Bio Handle** (0.95) - Extracts Twitter handles from Telegram bio - Patterns: `@username`, `twitter.com/username`, `x.com/username` 3. **Bio URL Resolution** (0.95) ⭐ NEW - Twitter bio contains shortened URL (t.co/xyz) that resolves to `t.me/username` - Queries `url_resolution_queue` table - Captures matches even when usernames differ 4. **Twitter Bio Has Telegram** (0.92) - Reverse lookup: Twitter bio mentions Telegram username - Searches for `@username`, `t.me/username`, `telegram.me/username` 5. **Display Name Containment** (0.92) - Telegram name contained within Twitter display name 6. **Exact Username** (0.90) - Telegram username exactly matches Twitter username ### Medium Confidence Methods (0.80-0.88) 7. **TG Username in Twitter Name** (0.88) 8. **Twitter Username in TG Name** (0.86) 9. **Fuzzy Name** (0.65-0.85) - PostgreSQL trigram similarity with 0.65 threshold 10. **Username Variation** (0.80) - Generates variations (remove underscores, flip numbers, etc.) ## LLM Verification The system uses GPT-5-mini with a sophisticated V6 prompt that: - Evaluates ALL candidates together (comparative evaluation) - Applies differential scoring (only one can be "most likely") - Distinguishes between personal and company accounts - Considers signal strength holistically - Only saves matches with 70%+ confidence ## Files ### Core Scripts - `main.py` - Interactive menu for running the system - `find_twitter_candidates.py` - Core matching logic (TwitterMatcher class) - `find_twitter_candidates_threaded.py` - Threaded implementation (RECOMMENDED) - `verify_twitter_matches_v2.py` - LLM verification with async (RECOMMENDED) - `review_match_quality.py` - Analyze match quality and statistics ### Database Schema - `setup_twitter_matching_schema.sql` - Database tables and indexes ## Database Tables ### `twitter_match_candidates` Stores all potential matches found by the matching methods. **Key fields:** - `telegram_user_id` - Telegram contact user ID - `twitter_id` - Twitter profile ID - `match_method` - Which method found this candidate - `baseline_confidence` - Initial confidence (0.0-1.0) - `match_signals` - JSON with match details - `llm_processed` - Whether LLM has evaluated this candidate ### `twitter_telegram_matches` Stores verified matches (70%+ confidence from LLM). **Key fields:** - `telegram_user_id` - Telegram contact - `twitter_id` - Matched Twitter profile - `final_confidence` - LLM-assigned confidence (0.70-1.0) - `llm_verdict` - LLM reasoning - `match_method` - Original matching method - `matched_at` - Timestamp ### `url_resolution_queue` Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links). **Key fields:** - `twitter_id` - Twitter profile ID - `original_url` - Shortened URL (e.g., t.co/abc) - `resolved_url` - Full URL (e.g., https://t.me/username) - `telegram_handles` - Extracted Telegram handles (JSONB array) ## Usage Examples ### Find Candidates for All Contacts (Threaded) ```bash python3.10 find_twitter_candidates_threaded.py --workers 8 ``` ### Find Candidates for First 1000 Contacts ```bash python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8 ``` ### Verify Matches with LLM (100 concurrent requests) ```bash python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100 ``` ### Test Mode (50 users, 10 concurrent) ```bash python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10 ``` ### Review Match Quality ```bash python3.10 review_match_quality.py ``` ## Performance ### Candidate Finding (Threaded) - **Speed**: ~1.5 contacts/sec - **Time for 43K contacts**: ~16-18 hours - **Workers**: 8 (default, configurable) ### LLM Verification (Async) - **Speed**: ~32 users/minute with 100 concurrent requests - **Cost**: ~$0.003 per user (GPT-5-mini) - **Time for 43K users**: ~23 hours ## Recent Improvements ### V6 Prompt (Latest) - Upfront directive for comparative evaluation - Clear signal strength hierarchy - Company vs personal account differentiation - Streamlined from ~135 to ~90 lines while being clearer ### URL Resolution Integration - Added Method 5b: Bio URL resolution - Captures 140+ additional matches - Especially valuable when usernames differ - 0.95 baseline confidence (very high) ## Configuration Environment variables (in `/Users/andrewjiang/Bao/TimeToLockIn/Profile/.env`): ``` OPENAI_API_KEY=your_key_here OPENAI_MODEL=gpt-5-mini ``` Database connections: - `telegram_contacts` - Telegram contact data - `twitter_data` - Twitter profile data ## Tips 1. **Always run threaded candidate finding** - 10-20x faster than single-threaded 2. **Use high concurrency for LLM verification** - 100+ concurrent requests for optimal speed 3. **Monitor costs** - Check OpenAI usage during verification 4. **Review match quality periodically** - Use `review_match_quality.py` to analyze results 5. **Test first** - Use `--test --limit 50` flags before full runs ## Troubleshooting ### LLM verification is slow - Increase `--concurrent` parameter (try 100-200) - Check OpenAI rate limits (1,000 RPM for Tier 1) ### Many low-quality matches - Review and adjust V6 prompt in `verify_twitter_matches_v2.py` - Check `review_match_quality.py` for insights ### Missing obvious matches - Check if candidate was found: Query `twitter_match_candidates` - If not found, may need new matching method - If found but not verified, check LLM reasoning in `llm_verdict` ## Future Enhancements - Add more matching methods (location, bio keywords, etc.) - Implement feedback loop for prompt improvement - Add manual review interface for borderline matches - Export matches to various formats