# ProfileMatching Changelog ## 2025-11-04 - Initial Module Creation & URL Resolution Integration ### Module Organization - Created dedicated `ProfileMatching/` folder within UnifiedContacts - Bundled all matching and verification scripts together - Added comprehensive interactive `main.py` with menu system - Added detailed `README.md` documentation ### Major Enhancement: URL Resolution Integration **Problem**: Missing matches where Twitter usernames differ from Telegram usernames, even when Twitter profile explicitly links to Telegram. **Solution**: Integrated `url_resolution_queue` table data into candidate finding. **Implementation**: - Added new method `find_by_resolved_url()` to TwitterMatcher class (find_twitter_candidates.py:339-378) - Queries Twitter profiles where bio URLs (t.co/xyz) resolve to t.me/{username} - Integrated as Method 5b in candidate finding pipeline (find_twitter_candidates.py:554-574) - Baseline confidence: 0.95 (very high - explicit link in bio) **Impact**: - 140+ potential new matches identified - Captures matches like Twitter @Block_Flash → Telegram @bull_flash - Especially valuable when usernames differ but user explicitly links profiles **Example Match**: ``` Twitter: @Block_Flash (ID: 63429948) Telegram: @bull_flash Method: twitter_bio_url_resolves_to_telegram Confidence: 0.95 Resolved URL: https://t.me/bull_flash Original URL: https://t.co/dc3iztSG9B ``` ### URL Resolution Data Source The `url_resolution_queue` table contains: - 16,133 Twitter profiles with resolved Telegram URLs - Shortened URLs from Twitter bios (t.co/xyz) - Resolved destinations (full URLs) - Extracted Telegram handles (JSONB array) ### Files Modified 1. **find_twitter_candidates.py** - Added `find_by_resolved_url()` method - Integrated into `find_candidates_for_contact()` as Method 5b - Fixed type casting for twitter_id (VARCHAR) to user.id (BIGINT) join 2. **main.py** (ProfileMatching module) - Created comprehensive interactive menu - Real-time statistics display - Streamlined workflow for candidate finding and LLM verification 3. **README.md** - Complete documentation of all 10 matching methods - Usage examples and performance metrics - Configuration and troubleshooting guides 4. **UnifiedContacts/main.py** - Added option 15: "Open Profile Matching System (Interactive Menu)" - Reorganized menu to separate ProfileMatching from individual Twitter steps ### Current Statistics (as of 2025-11-04) **Candidates**: - Users with candidates: 38,121 - Total candidates found: 253,117 - Processed by LLM: 253,085 - Pending verification: 32 **Verified Matches**: - Users with matches: 25,662 - Total matches: 36,147 - Average confidence: 0.74 - High confidence (90%+): 12,031 - Medium confidence (80-89%): 1,452 - Low confidence (70-79%): 7,505 ### LLM Verification (V6 Prompt) Current prompt improvements: - Upfront directive for comparative evaluation - Clear signal strength hierarchy (Very Strong, Strong Supporting, Weak, Red Flags) - Company vs personal account differentiation - Streamlined from ~135 to ~90 lines while being clearer - Emphasis on evaluating ALL candidates together ### Performance Metrics **Candidate Finding (Threaded)**: - Speed: ~1.5 contacts/sec - Time for 43K contacts: ~16-18 hours - Workers: 8 (default) **LLM Verification (Async)**: - Speed: ~32 users/minute (100 concurrent requests) - Cost: ~$0.003 per user (GPT-5-mini) - Time for 43K users: ~23 hours ### Module Structure ``` ProfileMatching/ ├── main.py # Interactive menu system ├── README.md # Complete documentation ├── CHANGELOG.md # This file ├── find_twitter_candidates.py # Core matching logic (10 methods) ├── find_twitter_candidates_threaded.py # Threaded implementation ├── verify_twitter_matches_v2.py # LLM verification (V6 prompt) ├── review_match_quality.py # Analysis tools └── setup_twitter_matching_schema.sql # Database schema ``` ### 10 Matching Methods Summary 1. **Phash Match** (0.95/0.88) - Profile picture similarity 2. **Exact Bio Handle** (0.95) - Twitter handle extracted from Telegram bio 3. **Bio URL Resolution** (0.95) ⭐ NEW - Shortened URL resolves to Telegram 4. **Twitter Bio Has Telegram** (0.92) - Twitter bio mentions Telegram username 5. **Display Name Containment** (0.92) - TG name in TW name 6. **Exact Username** (0.90) - Usernames match exactly 7. **TG Username in Twitter Name** (0.88) 8. **Twitter Username in TG Name** (0.86) 9. **Fuzzy Name** (0.65-0.85) - Trigram similarity 10. **Username Variation** (0.80) - Generated username variations ### Testing All changes tested with: - Standalone method testing (find_by_resolved_url) - Full integration testing (find_candidates_for_contact) - Verified deduplication works correctly - Confirmed matches with different usernames are captured ### Next Steps Potential future enhancements: - Add more matching methods (location, bio keywords, mutual connections) - Implement feedback loop for prompt improvement - Add manual review interface for borderline matches - Export matches to various formats - Additional URL resolution sources beyond Twitter bios ### Migration Notes **For existing deployments**: 1. No database schema changes required 2. Existing `url_resolution_queue` table is used as-is 3. Scripts in `scripts/` folder remain unchanged and functional 4. New ProfileMatching module is additive, doesn't break existing workflows **To use new features**: 1. Use ProfileMatching/main.py instead of individual scripts 2. Or run scripts directly from ProfileMatching folder 3. Or update import paths to use ProfileMatching module