Files
ProfileMatching/README.md
Andrew Jiang 5319d4d868 Initial commit: Twitter-Telegram Profile Matching System
This module provides comprehensive Twitter-to-Telegram profile matching
and verification using 10 different matching methods and LLM verification.

Features:
- 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names)
- URL resolution integration for t.co → t.me links
- Async LLM verification with GPT-5-mini
- Interactive menu system with real-time stats
- Threaded candidate finding (~1.5 contacts/sec)
- Comprehensive documentation and guides

Key Components:
- find_twitter_candidates.py: Core matching logic (10 methods)
- find_twitter_candidates_threaded.py: Threaded implementation
- verify_twitter_matches_v2.py: LLM verification (V5 prompt)
- review_match_quality.py: Analysis and quality review
- main.py: Interactive menu system
- Complete documentation (README, CHANGELOG, QUICKSTART)

Performance:
- Candidate finding: ~16-18 hours for 43K contacts
- LLM verification: ~23 hours for 43K users
- Cost: ~$130 for full verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-04 22:56:25 -08:00

208 lines
6.4 KiB
Markdown

# Twitter-Telegram Profile Matching System
A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.
## Overview
This system operates in two main steps:
1. **Candidate Finding**: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
2. **LLM Verification**: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)
## Quick Start
```bash
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
python3.10 main.py
```
## Matching Methods
The system uses 10 different methods to find Twitter candidates:
### High Confidence Methods (0.90-0.95)
1. **Phash Match** (0.95 for exact, 0.88 for distance=1)
- Compares profile picture hashes
- Pre-computed in `telegram_twitter_phash_matches` table
2. **Exact Bio Handle** (0.95)
- Extracts Twitter handles from Telegram bio
- Patterns: `@username`, `twitter.com/username`, `x.com/username`
3. **Bio URL Resolution** (0.95) ⭐ NEW
- Twitter bio contains shortened URL (t.co/xyz) that resolves to `t.me/username`
- Queries `url_resolution_queue` table
- Captures matches even when usernames differ
4. **Twitter Bio Has Telegram** (0.92)
- Reverse lookup: Twitter bio mentions Telegram username
- Searches for `@username`, `t.me/username`, `telegram.me/username`
5. **Display Name Containment** (0.92)
- Telegram name contained within Twitter display name
6. **Exact Username** (0.90)
- Telegram username exactly matches Twitter username
### Medium Confidence Methods (0.80-0.88)
7. **TG Username in Twitter Name** (0.88)
8. **Twitter Username in TG Name** (0.86)
9. **Fuzzy Name** (0.65-0.85)
- PostgreSQL trigram similarity with 0.65 threshold
10. **Username Variation** (0.80)
- Generates variations (remove underscores, flip numbers, etc.)
## LLM Verification
The system uses GPT-5-mini with a sophisticated V6 prompt that:
- Evaluates ALL candidates together (comparative evaluation)
- Applies differential scoring (only one can be "most likely")
- Distinguishes between personal and company accounts
- Considers signal strength holistically
- Only saves matches with 70%+ confidence
## Files
### Core Scripts
- `main.py` - Interactive menu for running the system
- `find_twitter_candidates.py` - Core matching logic (TwitterMatcher class)
- `find_twitter_candidates_threaded.py` - Threaded implementation (RECOMMENDED)
- `verify_twitter_matches_v2.py` - LLM verification with async (RECOMMENDED)
- `review_match_quality.py` - Analyze match quality and statistics
### Database Schema
- `setup_twitter_matching_schema.sql` - Database tables and indexes
## Database Tables
### `twitter_match_candidates`
Stores all potential matches found by the matching methods.
**Key fields:**
- `telegram_user_id` - Telegram contact user ID
- `twitter_id` - Twitter profile ID
- `match_method` - Which method found this candidate
- `baseline_confidence` - Initial confidence (0.0-1.0)
- `match_signals` - JSON with match details
- `llm_processed` - Whether LLM has evaluated this candidate
### `twitter_telegram_matches`
Stores verified matches (70%+ confidence from LLM).
**Key fields:**
- `telegram_user_id` - Telegram contact
- `twitter_id` - Matched Twitter profile
- `final_confidence` - LLM-assigned confidence (0.70-1.0)
- `llm_verdict` - LLM reasoning
- `match_method` - Original matching method
- `matched_at` - Timestamp
### `url_resolution_queue`
Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).
**Key fields:**
- `twitter_id` - Twitter profile ID
- `original_url` - Shortened URL (e.g., t.co/abc)
- `resolved_url` - Full URL (e.g., https://t.me/username)
- `telegram_handles` - Extracted Telegram handles (JSONB array)
## Usage Examples
### Find Candidates for All Contacts (Threaded)
```bash
python3.10 find_twitter_candidates_threaded.py --workers 8
```
### Find Candidates for First 1000 Contacts
```bash
python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
```
### Verify Matches with LLM (100 concurrent requests)
```bash
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
```
### Test Mode (50 users, 10 concurrent)
```bash
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
```
### Review Match Quality
```bash
python3.10 review_match_quality.py
```
## Performance
### Candidate Finding (Threaded)
- **Speed**: ~1.5 contacts/sec
- **Time for 43K contacts**: ~16-18 hours
- **Workers**: 8 (default, configurable)
### LLM Verification (Async)
- **Speed**: ~32 users/minute with 100 concurrent requests
- **Cost**: ~$0.003 per user (GPT-5-mini)
- **Time for 43K users**: ~23 hours
## Recent Improvements
### V6 Prompt (Latest)
- Upfront directive for comparative evaluation
- Clear signal strength hierarchy
- Company vs personal account differentiation
- Streamlined from ~135 to ~90 lines while being clearer
### URL Resolution Integration
- Added Method 5b: Bio URL resolution
- Captures 140+ additional matches
- Especially valuable when usernames differ
- 0.95 baseline confidence (very high)
## Configuration
Environment variables (in `/Users/andrewjiang/Bao/TimeToLockIn/Profile/.env`):
```
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-5-mini
```
Database connections:
- `telegram_contacts` - Telegram contact data
- `twitter_data` - Twitter profile data
## Tips
1. **Always run threaded candidate finding** - 10-20x faster than single-threaded
2. **Use high concurrency for LLM verification** - 100+ concurrent requests for optimal speed
3. **Monitor costs** - Check OpenAI usage during verification
4. **Review match quality periodically** - Use `review_match_quality.py` to analyze results
5. **Test first** - Use `--test --limit 50` flags before full runs
## Troubleshooting
### LLM verification is slow
- Increase `--concurrent` parameter (try 100-200)
- Check OpenAI rate limits (1,000 RPM for Tier 1)
### Many low-quality matches
- Review and adjust V6 prompt in `verify_twitter_matches_v2.py`
- Check `review_match_quality.py` for insights
### Missing obvious matches
- Check if candidate was found: Query `twitter_match_candidates`
- If not found, may need new matching method
- If found but not verified, check LLM reasoning in `llm_verdict`
## Future Enhancements
- Add more matching methods (location, bio keywords, etc.)
- Implement feedback loop for prompt improvement
- Add manual review interface for borderline matches
- Export matches to various formats