mirror of
https://github.com/lockin-bot/ProfileMatching.git
synced 2026-01-12 09:44:30 +08:00
This module provides comprehensive Twitter-to-Telegram profile matching and verification using 10 different matching methods and LLM verification. Features: - 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names) - URL resolution integration for t.co → t.me links - Async LLM verification with GPT-5-mini - Interactive menu system with real-time stats - Threaded candidate finding (~1.5 contacts/sec) - Comprehensive documentation and guides Key Components: - find_twitter_candidates.py: Core matching logic (10 methods) - find_twitter_candidates_threaded.py: Threaded implementation - verify_twitter_matches_v2.py: LLM verification (V5 prompt) - review_match_quality.py: Analysis and quality review - main.py: Interactive menu system - Complete documentation (README, CHANGELOG, QUICKSTART) Performance: - Candidate finding: ~16-18 hours for 43K contacts - LLM verification: ~23 hours for 43K users - Cost: ~$130 for full verification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
160 lines
5.6 KiB
Markdown
160 lines
5.6 KiB
Markdown
# ProfileMatching Changelog
|
|
|
|
## 2025-11-04 - Initial Module Creation & URL Resolution Integration
|
|
|
|
### Module Organization
|
|
- Created dedicated `ProfileMatching/` folder within UnifiedContacts
|
|
- Bundled all matching and verification scripts together
|
|
- Added comprehensive interactive `main.py` with menu system
|
|
- Added detailed `README.md` documentation
|
|
|
|
### Major Enhancement: URL Resolution Integration
|
|
|
|
**Problem**: Missing matches where Twitter usernames differ from Telegram usernames, even when Twitter profile explicitly links to Telegram.
|
|
|
|
**Solution**: Integrated `url_resolution_queue` table data into candidate finding.
|
|
|
|
**Implementation**:
|
|
- Added new method `find_by_resolved_url()` to TwitterMatcher class (find_twitter_candidates.py:339-378)
|
|
- Queries Twitter profiles where bio URLs (t.co/xyz) resolve to t.me/{username}
|
|
- Integrated as Method 5b in candidate finding pipeline (find_twitter_candidates.py:554-574)
|
|
- Baseline confidence: 0.95 (very high - explicit link in bio)
|
|
|
|
**Impact**:
|
|
- 140+ potential new matches identified
|
|
- Captures matches like Twitter @Block_Flash → Telegram @bull_flash
|
|
- Especially valuable when usernames differ but user explicitly links profiles
|
|
|
|
**Example Match**:
|
|
```
|
|
Twitter: @Block_Flash (ID: 63429948)
|
|
Telegram: @bull_flash
|
|
Method: twitter_bio_url_resolves_to_telegram
|
|
Confidence: 0.95
|
|
Resolved URL: https://t.me/bull_flash
|
|
Original URL: https://t.co/dc3iztSG9B
|
|
```
|
|
|
|
### URL Resolution Data Source
|
|
|
|
The `url_resolution_queue` table contains:
|
|
- 16,133 Twitter profiles with resolved Telegram URLs
|
|
- Shortened URLs from Twitter bios (t.co/xyz)
|
|
- Resolved destinations (full URLs)
|
|
- Extracted Telegram handles (JSONB array)
|
|
|
|
### Files Modified
|
|
|
|
1. **find_twitter_candidates.py**
|
|
- Added `find_by_resolved_url()` method
|
|
- Integrated into `find_candidates_for_contact()` as Method 5b
|
|
- Fixed type casting for twitter_id (VARCHAR) to user.id (BIGINT) join
|
|
|
|
2. **main.py** (ProfileMatching module)
|
|
- Created comprehensive interactive menu
|
|
- Real-time statistics display
|
|
- Streamlined workflow for candidate finding and LLM verification
|
|
|
|
3. **README.md**
|
|
- Complete documentation of all 10 matching methods
|
|
- Usage examples and performance metrics
|
|
- Configuration and troubleshooting guides
|
|
|
|
4. **UnifiedContacts/main.py**
|
|
- Added option 15: "Open Profile Matching System (Interactive Menu)"
|
|
- Reorganized menu to separate ProfileMatching from individual Twitter steps
|
|
|
|
### Current Statistics (as of 2025-11-04)
|
|
|
|
**Candidates**:
|
|
- Users with candidates: 38,121
|
|
- Total candidates found: 253,117
|
|
- Processed by LLM: 253,085
|
|
- Pending verification: 32
|
|
|
|
**Verified Matches**:
|
|
- Users with matches: 25,662
|
|
- Total matches: 36,147
|
|
- Average confidence: 0.74
|
|
- High confidence (90%+): 12,031
|
|
- Medium confidence (80-89%): 1,452
|
|
- Low confidence (70-79%): 7,505
|
|
|
|
### LLM Verification (V6 Prompt)
|
|
|
|
Current prompt improvements:
|
|
- Upfront directive for comparative evaluation
|
|
- Clear signal strength hierarchy (Very Strong, Strong Supporting, Weak, Red Flags)
|
|
- Company vs personal account differentiation
|
|
- Streamlined from ~135 to ~90 lines while being clearer
|
|
- Emphasis on evaluating ALL candidates together
|
|
|
|
### Performance Metrics
|
|
|
|
**Candidate Finding (Threaded)**:
|
|
- Speed: ~1.5 contacts/sec
|
|
- Time for 43K contacts: ~16-18 hours
|
|
- Workers: 8 (default)
|
|
|
|
**LLM Verification (Async)**:
|
|
- Speed: ~32 users/minute (100 concurrent requests)
|
|
- Cost: ~$0.003 per user (GPT-5-mini)
|
|
- Time for 43K users: ~23 hours
|
|
|
|
### Module Structure
|
|
|
|
```
|
|
ProfileMatching/
|
|
├── main.py # Interactive menu system
|
|
├── README.md # Complete documentation
|
|
├── CHANGELOG.md # This file
|
|
├── find_twitter_candidates.py # Core matching logic (10 methods)
|
|
├── find_twitter_candidates_threaded.py # Threaded implementation
|
|
├── verify_twitter_matches_v2.py # LLM verification (V6 prompt)
|
|
├── review_match_quality.py # Analysis tools
|
|
└── setup_twitter_matching_schema.sql # Database schema
|
|
```
|
|
|
|
### 10 Matching Methods Summary
|
|
|
|
1. **Phash Match** (0.95/0.88) - Profile picture similarity
|
|
2. **Exact Bio Handle** (0.95) - Twitter handle extracted from Telegram bio
|
|
3. **Bio URL Resolution** (0.95) ⭐ NEW - Shortened URL resolves to Telegram
|
|
4. **Twitter Bio Has Telegram** (0.92) - Twitter bio mentions Telegram username
|
|
5. **Display Name Containment** (0.92) - TG name in TW name
|
|
6. **Exact Username** (0.90) - Usernames match exactly
|
|
7. **TG Username in Twitter Name** (0.88)
|
|
8. **Twitter Username in TG Name** (0.86)
|
|
9. **Fuzzy Name** (0.65-0.85) - Trigram similarity
|
|
10. **Username Variation** (0.80) - Generated username variations
|
|
|
|
### Testing
|
|
|
|
All changes tested with:
|
|
- Standalone method testing (find_by_resolved_url)
|
|
- Full integration testing (find_candidates_for_contact)
|
|
- Verified deduplication works correctly
|
|
- Confirmed matches with different usernames are captured
|
|
|
|
### Next Steps
|
|
|
|
Potential future enhancements:
|
|
- Add more matching methods (location, bio keywords, mutual connections)
|
|
- Implement feedback loop for prompt improvement
|
|
- Add manual review interface for borderline matches
|
|
- Export matches to various formats
|
|
- Additional URL resolution sources beyond Twitter bios
|
|
|
|
### Migration Notes
|
|
|
|
**For existing deployments**:
|
|
1. No database schema changes required
|
|
2. Existing `url_resolution_queue` table is used as-is
|
|
3. Scripts in `scripts/` folder remain unchanged and functional
|
|
4. New ProfileMatching module is additive, doesn't break existing workflows
|
|
|
|
**To use new features**:
|
|
1. Use ProfileMatching/main.py instead of individual scripts
|
|
2. Or run scripts directly from ProfileMatching folder
|
|
3. Or update import paths to use ProfileMatching module
|