mirror of
https://github.com/lockin-bot/ProfileMatching.git
synced 2026-01-12 09:44:30 +08:00
This module provides comprehensive Twitter-to-Telegram profile matching and verification using 10 different matching methods and LLM verification. Features: - 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names) - URL resolution integration for t.co → t.me links - Async LLM verification with GPT-5-mini - Interactive menu system with real-time stats - Threaded candidate finding (~1.5 contacts/sec) - Comprehensive documentation and guides Key Components: - find_twitter_candidates.py: Core matching logic (10 methods) - find_twitter_candidates_threaded.py: Threaded implementation - verify_twitter_matches_v2.py: LLM verification (V5 prompt) - review_match_quality.py: Analysis and quality review - main.py: Interactive menu system - Complete documentation (README, CHANGELOG, QUICKSTART) Performance: - Candidate finding: ~16-18 hours for 43K contacts - LLM verification: ~23 hours for 43K users - Cost: ~$130 for full verification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
208 lines
6.4 KiB
Markdown
208 lines
6.4 KiB
Markdown
# Twitter-Telegram Profile Matching System
|
|
|
|
A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.
|
|
|
|
## Overview
|
|
|
|
This system operates in two main steps:
|
|
|
|
1. **Candidate Finding**: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
|
|
2. **LLM Verification**: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
|
|
python3.10 main.py
|
|
```
|
|
|
|
## Matching Methods
|
|
|
|
The system uses 10 different methods to find Twitter candidates:
|
|
|
|
### High Confidence Methods (0.90-0.95)
|
|
|
|
1. **Phash Match** (0.95 for exact, 0.88 for distance=1)
|
|
- Compares profile picture hashes
|
|
- Pre-computed in `telegram_twitter_phash_matches` table
|
|
|
|
2. **Exact Bio Handle** (0.95)
|
|
- Extracts Twitter handles from Telegram bio
|
|
- Patterns: `@username`, `twitter.com/username`, `x.com/username`
|
|
|
|
3. **Bio URL Resolution** (0.95) ⭐ NEW
|
|
- Twitter bio contains shortened URL (t.co/xyz) that resolves to `t.me/username`
|
|
- Queries `url_resolution_queue` table
|
|
- Captures matches even when usernames differ
|
|
|
|
4. **Twitter Bio Has Telegram** (0.92)
|
|
- Reverse lookup: Twitter bio mentions Telegram username
|
|
- Searches for `@username`, `t.me/username`, `telegram.me/username`
|
|
|
|
5. **Display Name Containment** (0.92)
|
|
- Telegram name contained within Twitter display name
|
|
|
|
6. **Exact Username** (0.90)
|
|
- Telegram username exactly matches Twitter username
|
|
|
|
### Medium Confidence Methods (0.80-0.88)
|
|
|
|
7. **TG Username in Twitter Name** (0.88)
|
|
8. **Twitter Username in TG Name** (0.86)
|
|
9. **Fuzzy Name** (0.65-0.85)
|
|
- PostgreSQL trigram similarity with 0.65 threshold
|
|
10. **Username Variation** (0.80)
|
|
- Generates variations (remove underscores, flip numbers, etc.)
|
|
|
|
## LLM Verification
|
|
|
|
The system uses GPT-5-mini with a sophisticated V6 prompt that:
|
|
|
|
- Evaluates ALL candidates together (comparative evaluation)
|
|
- Applies differential scoring (only one can be "most likely")
|
|
- Distinguishes between personal and company accounts
|
|
- Considers signal strength holistically
|
|
- Only saves matches with 70%+ confidence
|
|
|
|
## Files
|
|
|
|
### Core Scripts
|
|
|
|
- `main.py` - Interactive menu for running the system
|
|
- `find_twitter_candidates.py` - Core matching logic (TwitterMatcher class)
|
|
- `find_twitter_candidates_threaded.py` - Threaded implementation (RECOMMENDED)
|
|
- `verify_twitter_matches_v2.py` - LLM verification with async (RECOMMENDED)
|
|
- `review_match_quality.py` - Analyze match quality and statistics
|
|
|
|
### Database Schema
|
|
|
|
- `setup_twitter_matching_schema.sql` - Database tables and indexes
|
|
|
|
## Database Tables
|
|
|
|
### `twitter_match_candidates`
|
|
Stores all potential matches found by the matching methods.
|
|
|
|
**Key fields:**
|
|
- `telegram_user_id` - Telegram contact user ID
|
|
- `twitter_id` - Twitter profile ID
|
|
- `match_method` - Which method found this candidate
|
|
- `baseline_confidence` - Initial confidence (0.0-1.0)
|
|
- `match_signals` - JSON with match details
|
|
- `llm_processed` - Whether LLM has evaluated this candidate
|
|
|
|
### `twitter_telegram_matches`
|
|
Stores verified matches (70%+ confidence from LLM).
|
|
|
|
**Key fields:**
|
|
- `telegram_user_id` - Telegram contact
|
|
- `twitter_id` - Matched Twitter profile
|
|
- `final_confidence` - LLM-assigned confidence (0.70-1.0)
|
|
- `llm_verdict` - LLM reasoning
|
|
- `match_method` - Original matching method
|
|
- `matched_at` - Timestamp
|
|
|
|
### `url_resolution_queue`
|
|
Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).
|
|
|
|
**Key fields:**
|
|
- `twitter_id` - Twitter profile ID
|
|
- `original_url` - Shortened URL (e.g., t.co/abc)
|
|
- `resolved_url` - Full URL (e.g., https://t.me/username)
|
|
- `telegram_handles` - Extracted Telegram handles (JSONB array)
|
|
|
|
## Usage Examples
|
|
|
|
### Find Candidates for All Contacts (Threaded)
|
|
```bash
|
|
python3.10 find_twitter_candidates_threaded.py --workers 8
|
|
```
|
|
|
|
### Find Candidates for First 1000 Contacts
|
|
```bash
|
|
python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
|
|
```
|
|
|
|
### Verify Matches with LLM (100 concurrent requests)
|
|
```bash
|
|
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
|
|
```
|
|
|
|
### Test Mode (50 users, 10 concurrent)
|
|
```bash
|
|
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
|
|
```
|
|
|
|
### Review Match Quality
|
|
```bash
|
|
python3.10 review_match_quality.py
|
|
```
|
|
|
|
## Performance
|
|
|
|
### Candidate Finding (Threaded)
|
|
- **Speed**: ~1.5 contacts/sec
|
|
- **Time for 43K contacts**: ~16-18 hours
|
|
- **Workers**: 8 (default, configurable)
|
|
|
|
### LLM Verification (Async)
|
|
- **Speed**: ~32 users/minute with 100 concurrent requests
|
|
- **Cost**: ~$0.003 per user (GPT-5-mini)
|
|
- **Time for 43K users**: ~23 hours
|
|
|
|
## Recent Improvements
|
|
|
|
### V6 Prompt (Latest)
|
|
- Upfront directive for comparative evaluation
|
|
- Clear signal strength hierarchy
|
|
- Company vs personal account differentiation
|
|
- Streamlined from ~135 to ~90 lines while being clearer
|
|
|
|
### URL Resolution Integration
|
|
- Added Method 5b: Bio URL resolution
|
|
- Captures 140+ additional matches
|
|
- Especially valuable when usernames differ
|
|
- 0.95 baseline confidence (very high)
|
|
|
|
## Configuration
|
|
|
|
Environment variables (in `/Users/andrewjiang/Bao/TimeToLockIn/Profile/.env`):
|
|
```
|
|
OPENAI_API_KEY=your_key_here
|
|
OPENAI_MODEL=gpt-5-mini
|
|
```
|
|
|
|
Database connections:
|
|
- `telegram_contacts` - Telegram contact data
|
|
- `twitter_data` - Twitter profile data
|
|
|
|
## Tips
|
|
|
|
1. **Always run threaded candidate finding** - 10-20x faster than single-threaded
|
|
2. **Use high concurrency for LLM verification** - 100+ concurrent requests for optimal speed
|
|
3. **Monitor costs** - Check OpenAI usage during verification
|
|
4. **Review match quality periodically** - Use `review_match_quality.py` to analyze results
|
|
5. **Test first** - Use `--test --limit 50` flags before full runs
|
|
|
|
## Troubleshooting
|
|
|
|
### LLM verification is slow
|
|
- Increase `--concurrent` parameter (try 100-200)
|
|
- Check OpenAI rate limits (1,000 RPM for Tier 1)
|
|
|
|
### Many low-quality matches
|
|
- Review and adjust V6 prompt in `verify_twitter_matches_v2.py`
|
|
- Check `review_match_quality.py` for insights
|
|
|
|
### Missing obvious matches
|
|
- Check if candidate was found: Query `twitter_match_candidates`
|
|
- If not found, may need new matching method
|
|
- If found but not verified, check LLM reasoning in `llm_verdict`
|
|
|
|
## Future Enhancements
|
|
|
|
- Add more matching methods (location, bio keywords, etc.)
|
|
- Implement feedback loop for prompt improvement
|
|
- Add manual review interface for borderline matches
|
|
- Export matches to various formats
|