Files
ProfileMatching/CHANGELOG.md
Andrew Jiang 5319d4d868 Initial commit: Twitter-Telegram Profile Matching System
This module provides comprehensive Twitter-to-Telegram profile matching
and verification using 10 different matching methods and LLM verification.

Features:
- 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names)
- URL resolution integration for t.co → t.me links
- Async LLM verification with GPT-5-mini
- Interactive menu system with real-time stats
- Threaded candidate finding (~1.5 contacts/sec)
- Comprehensive documentation and guides

Key Components:
- find_twitter_candidates.py: Core matching logic (10 methods)
- find_twitter_candidates_threaded.py: Threaded implementation
- verify_twitter_matches_v2.py: LLM verification (V5 prompt)
- review_match_quality.py: Analysis and quality review
- main.py: Interactive menu system
- Complete documentation (README, CHANGELOG, QUICKSTART)

Performance:
- Candidate finding: ~16-18 hours for 43K contacts
- LLM verification: ~23 hours for 43K users
- Cost: ~$130 for full verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-04 22:56:25 -08:00

160 lines
5.6 KiB
Markdown

# ProfileMatching Changelog
## 2025-11-04 - Initial Module Creation & URL Resolution Integration
### Module Organization
- Created dedicated `ProfileMatching/` folder within UnifiedContacts
- Bundled all matching and verification scripts together
- Added comprehensive interactive `main.py` with menu system
- Added detailed `README.md` documentation
### Major Enhancement: URL Resolution Integration
**Problem**: Missing matches where Twitter usernames differ from Telegram usernames, even when Twitter profile explicitly links to Telegram.
**Solution**: Integrated `url_resolution_queue` table data into candidate finding.
**Implementation**:
- Added new method `find_by_resolved_url()` to TwitterMatcher class (find_twitter_candidates.py:339-378)
- Queries Twitter profiles where bio URLs (t.co/xyz) resolve to t.me/{username}
- Integrated as Method 5b in candidate finding pipeline (find_twitter_candidates.py:554-574)
- Baseline confidence: 0.95 (very high - explicit link in bio)
**Impact**:
- 140+ potential new matches identified
- Captures matches like Twitter @Block_Flash → Telegram @bull_flash
- Especially valuable when usernames differ but user explicitly links profiles
**Example Match**:
```
Twitter: @Block_Flash (ID: 63429948)
Telegram: @bull_flash
Method: twitter_bio_url_resolves_to_telegram
Confidence: 0.95
Resolved URL: https://t.me/bull_flash
Original URL: https://t.co/dc3iztSG9B
```
### URL Resolution Data Source
The `url_resolution_queue` table contains:
- 16,133 Twitter profiles with resolved Telegram URLs
- Shortened URLs from Twitter bios (t.co/xyz)
- Resolved destinations (full URLs)
- Extracted Telegram handles (JSONB array)
### Files Modified
1. **find_twitter_candidates.py**
- Added `find_by_resolved_url()` method
- Integrated into `find_candidates_for_contact()` as Method 5b
- Fixed type casting for twitter_id (VARCHAR) to user.id (BIGINT) join
2. **main.py** (ProfileMatching module)
- Created comprehensive interactive menu
- Real-time statistics display
- Streamlined workflow for candidate finding and LLM verification
3. **README.md**
- Complete documentation of all 10 matching methods
- Usage examples and performance metrics
- Configuration and troubleshooting guides
4. **UnifiedContacts/main.py**
- Added option 15: "Open Profile Matching System (Interactive Menu)"
- Reorganized menu to separate ProfileMatching from individual Twitter steps
### Current Statistics (as of 2025-11-04)
**Candidates**:
- Users with candidates: 38,121
- Total candidates found: 253,117
- Processed by LLM: 253,085
- Pending verification: 32
**Verified Matches**:
- Users with matches: 25,662
- Total matches: 36,147
- Average confidence: 0.74
- High confidence (90%+): 12,031
- Medium confidence (80-89%): 1,452
- Low confidence (70-79%): 7,505
### LLM Verification (V6 Prompt)
Current prompt improvements:
- Upfront directive for comparative evaluation
- Clear signal strength hierarchy (Very Strong, Strong Supporting, Weak, Red Flags)
- Company vs personal account differentiation
- Streamlined from ~135 to ~90 lines while being clearer
- Emphasis on evaluating ALL candidates together
### Performance Metrics
**Candidate Finding (Threaded)**:
- Speed: ~1.5 contacts/sec
- Time for 43K contacts: ~16-18 hours
- Workers: 8 (default)
**LLM Verification (Async)**:
- Speed: ~32 users/minute (100 concurrent requests)
- Cost: ~$0.003 per user (GPT-5-mini)
- Time for 43K users: ~23 hours
### Module Structure
```
ProfileMatching/
├── main.py # Interactive menu system
├── README.md # Complete documentation
├── CHANGELOG.md # This file
├── find_twitter_candidates.py # Core matching logic (10 methods)
├── find_twitter_candidates_threaded.py # Threaded implementation
├── verify_twitter_matches_v2.py # LLM verification (V6 prompt)
├── review_match_quality.py # Analysis tools
└── setup_twitter_matching_schema.sql # Database schema
```
### 10 Matching Methods Summary
1. **Phash Match** (0.95/0.88) - Profile picture similarity
2. **Exact Bio Handle** (0.95) - Twitter handle extracted from Telegram bio
3. **Bio URL Resolution** (0.95) ⭐ NEW - Shortened URL resolves to Telegram
4. **Twitter Bio Has Telegram** (0.92) - Twitter bio mentions Telegram username
5. **Display Name Containment** (0.92) - TG name in TW name
6. **Exact Username** (0.90) - Usernames match exactly
7. **TG Username in Twitter Name** (0.88)
8. **Twitter Username in TG Name** (0.86)
9. **Fuzzy Name** (0.65-0.85) - Trigram similarity
10. **Username Variation** (0.80) - Generated username variations
### Testing
All changes tested with:
- Standalone method testing (find_by_resolved_url)
- Full integration testing (find_candidates_for_contact)
- Verified deduplication works correctly
- Confirmed matches with different usernames are captured
### Next Steps
Potential future enhancements:
- Add more matching methods (location, bio keywords, mutual connections)
- Implement feedback loop for prompt improvement
- Add manual review interface for borderline matches
- Export matches to various formats
- Additional URL resolution sources beyond Twitter bios
### Migration Notes
**For existing deployments**:
1. No database schema changes required
2. Existing `url_resolution_queue` table is used as-is
3. Scripts in `scripts/` folder remain unchanged and functional
4. New ProfileMatching module is additive, doesn't break existing workflows
**To use new features**:
1. Use ProfileMatching/main.py instead of individual scripts
2. Or run scripts directly from ProfileMatching folder
3. Or update import paths to use ProfileMatching module