mirror of
https://github.com/lockin-bot/ProfileMatching.git
synced 2026-01-12 09:44:30 +08:00
This module provides comprehensive Twitter-to-Telegram profile matching and verification using 10 different matching methods and LLM verification. Features: - 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names) - URL resolution integration for t.co → t.me links - Async LLM verification with GPT-5-mini - Interactive menu system with real-time stats - Threaded candidate finding (~1.5 contacts/sec) - Comprehensive documentation and guides Key Components: - find_twitter_candidates.py: Core matching logic (10 methods) - find_twitter_candidates_threaded.py: Threaded implementation - verify_twitter_matches_v2.py: LLM verification (V5 prompt) - review_match_quality.py: Analysis and quality review - main.py: Interactive menu system - Complete documentation (README, CHANGELOG, QUICKSTART) Performance: - Candidate finding: ~16-18 hours for 43K contacts - LLM verification: ~23 hours for 43K users - Cost: ~$130 for full verification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
5.6 KiB
5.6 KiB
ProfileMatching Changelog
2025-11-04 - Initial Module Creation & URL Resolution Integration
Module Organization
- Created dedicated
ProfileMatching/folder within UnifiedContacts - Bundled all matching and verification scripts together
- Added comprehensive interactive
main.pywith menu system - Added detailed
README.mddocumentation
Major Enhancement: URL Resolution Integration
Problem: Missing matches where Twitter usernames differ from Telegram usernames, even when Twitter profile explicitly links to Telegram.
Solution: Integrated url_resolution_queue table data into candidate finding.
Implementation:
- Added new method
find_by_resolved_url()to TwitterMatcher class (find_twitter_candidates.py:339-378) - Queries Twitter profiles where bio URLs (t.co/xyz) resolve to t.me/{username}
- Integrated as Method 5b in candidate finding pipeline (find_twitter_candidates.py:554-574)
- Baseline confidence: 0.95 (very high - explicit link in bio)
Impact:
- 140+ potential new matches identified
- Captures matches like Twitter @Block_Flash → Telegram @bull_flash
- Especially valuable when usernames differ but user explicitly links profiles
Example Match:
Twitter: @Block_Flash (ID: 63429948)
Telegram: @bull_flash
Method: twitter_bio_url_resolves_to_telegram
Confidence: 0.95
Resolved URL: https://t.me/bull_flash
Original URL: https://t.co/dc3iztSG9B
URL Resolution Data Source
The url_resolution_queue table contains:
- 16,133 Twitter profiles with resolved Telegram URLs
- Shortened URLs from Twitter bios (t.co/xyz)
- Resolved destinations (full URLs)
- Extracted Telegram handles (JSONB array)
Files Modified
-
find_twitter_candidates.py
- Added
find_by_resolved_url()method - Integrated into
find_candidates_for_contact()as Method 5b - Fixed type casting for twitter_id (VARCHAR) to user.id (BIGINT) join
- Added
-
main.py (ProfileMatching module)
- Created comprehensive interactive menu
- Real-time statistics display
- Streamlined workflow for candidate finding and LLM verification
-
README.md
- Complete documentation of all 10 matching methods
- Usage examples and performance metrics
- Configuration and troubleshooting guides
-
UnifiedContacts/main.py
- Added option 15: "Open Profile Matching System (Interactive Menu)"
- Reorganized menu to separate ProfileMatching from individual Twitter steps
Current Statistics (as of 2025-11-04)
Candidates:
- Users with candidates: 38,121
- Total candidates found: 253,117
- Processed by LLM: 253,085
- Pending verification: 32
Verified Matches:
- Users with matches: 25,662
- Total matches: 36,147
- Average confidence: 0.74
- High confidence (90%+): 12,031
- Medium confidence (80-89%): 1,452
- Low confidence (70-79%): 7,505
LLM Verification (V6 Prompt)
Current prompt improvements:
- Upfront directive for comparative evaluation
- Clear signal strength hierarchy (Very Strong, Strong Supporting, Weak, Red Flags)
- Company vs personal account differentiation
- Streamlined from ~135 to ~90 lines while being clearer
- Emphasis on evaluating ALL candidates together
Performance Metrics
Candidate Finding (Threaded):
- Speed: ~1.5 contacts/sec
- Time for 43K contacts: ~16-18 hours
- Workers: 8 (default)
LLM Verification (Async):
- Speed: ~32 users/minute (100 concurrent requests)
- Cost: ~$0.003 per user (GPT-5-mini)
- Time for 43K users: ~23 hours
Module Structure
ProfileMatching/
├── main.py # Interactive menu system
├── README.md # Complete documentation
├── CHANGELOG.md # This file
├── find_twitter_candidates.py # Core matching logic (10 methods)
├── find_twitter_candidates_threaded.py # Threaded implementation
├── verify_twitter_matches_v2.py # LLM verification (V6 prompt)
├── review_match_quality.py # Analysis tools
└── setup_twitter_matching_schema.sql # Database schema
10 Matching Methods Summary
- Phash Match (0.95/0.88) - Profile picture similarity
- Exact Bio Handle (0.95) - Twitter handle extracted from Telegram bio
- Bio URL Resolution (0.95) ⭐ NEW - Shortened URL resolves to Telegram
- Twitter Bio Has Telegram (0.92) - Twitter bio mentions Telegram username
- Display Name Containment (0.92) - TG name in TW name
- Exact Username (0.90) - Usernames match exactly
- TG Username in Twitter Name (0.88)
- Twitter Username in TG Name (0.86)
- Fuzzy Name (0.65-0.85) - Trigram similarity
- Username Variation (0.80) - Generated username variations
Testing
All changes tested with:
- Standalone method testing (find_by_resolved_url)
- Full integration testing (find_candidates_for_contact)
- Verified deduplication works correctly
- Confirmed matches with different usernames are captured
Next Steps
Potential future enhancements:
- Add more matching methods (location, bio keywords, mutual connections)
- Implement feedback loop for prompt improvement
- Add manual review interface for borderline matches
- Export matches to various formats
- Additional URL resolution sources beyond Twitter bios
Migration Notes
For existing deployments:
- No database schema changes required
- Existing
url_resolution_queuetable is used as-is - Scripts in
scripts/folder remain unchanged and functional - New ProfileMatching module is additive, doesn't break existing workflows
To use new features:
- Use ProfileMatching/main.py instead of individual scripts
- Or run scripts directly from ProfileMatching folder
- Or update import paths to use ProfileMatching module