mirror of
https://github.com/lockin-bot/ProfileMatching.git
synced 2026-01-12 09:44:30 +08:00
Initial commit: Twitter-Telegram Profile Matching System
This module provides comprehensive Twitter-to-Telegram profile matching and verification using 10 different matching methods and LLM verification. Features: - 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names) - URL resolution integration for t.co → t.me links - Async LLM verification with GPT-5-mini - Interactive menu system with real-time stats - Threaded candidate finding (~1.5 contacts/sec) - Comprehensive documentation and guides Key Components: - find_twitter_candidates.py: Core matching logic (10 methods) - find_twitter_candidates_threaded.py: Threaded implementation - verify_twitter_matches_v2.py: LLM verification (V5 prompt) - review_match_quality.py: Analysis and quality review - main.py: Interactive menu system - Complete documentation (README, CHANGELOG, QUICKSTART) Performance: - Candidate finding: ~16-18 hours for 43K contacts - LLM verification: ~23 hours for 43K users - Cost: ~$130 for full verification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
47
.gitignore
vendored
Normal file
47
.gitignore
vendored
Normal file
@@ -0,0 +1,47 @@
|
|||||||
|
# Environment variables
|
||||||
|
.env
|
||||||
|
.env.local
|
||||||
|
.env.*.local
|
||||||
|
|
||||||
|
# Python
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
*.so
|
||||||
|
.Python
|
||||||
|
env/
|
||||||
|
venv/
|
||||||
|
ENV/
|
||||||
|
build/
|
||||||
|
dist/
|
||||||
|
*.egg-info/
|
||||||
|
|
||||||
|
# IDE
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
*.swp
|
||||||
|
*.swo
|
||||||
|
*~
|
||||||
|
|
||||||
|
# Logs
|
||||||
|
*.log
|
||||||
|
*.out
|
||||||
|
*.pid
|
||||||
|
|
||||||
|
# Database
|
||||||
|
*.db
|
||||||
|
*.sqlite
|
||||||
|
*.sqlite3
|
||||||
|
|
||||||
|
# Checkpoints and temp files
|
||||||
|
*_checkpoint.json
|
||||||
|
*.tmp
|
||||||
|
*.temp
|
||||||
|
|
||||||
|
# OS
|
||||||
|
.DS_Store
|
||||||
|
Thumbs.db
|
||||||
|
|
||||||
|
# Test outputs
|
||||||
|
test_output/
|
||||||
|
*.test
|
||||||
159
CHANGELOG.md
Normal file
159
CHANGELOG.md
Normal file
@@ -0,0 +1,159 @@
|
|||||||
|
# ProfileMatching Changelog
|
||||||
|
|
||||||
|
## 2025-11-04 - Initial Module Creation & URL Resolution Integration
|
||||||
|
|
||||||
|
### Module Organization
|
||||||
|
- Created dedicated `ProfileMatching/` folder within UnifiedContacts
|
||||||
|
- Bundled all matching and verification scripts together
|
||||||
|
- Added comprehensive interactive `main.py` with menu system
|
||||||
|
- Added detailed `README.md` documentation
|
||||||
|
|
||||||
|
### Major Enhancement: URL Resolution Integration
|
||||||
|
|
||||||
|
**Problem**: Missing matches where Twitter usernames differ from Telegram usernames, even when Twitter profile explicitly links to Telegram.
|
||||||
|
|
||||||
|
**Solution**: Integrated `url_resolution_queue` table data into candidate finding.
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
- Added new method `find_by_resolved_url()` to TwitterMatcher class (find_twitter_candidates.py:339-378)
|
||||||
|
- Queries Twitter profiles where bio URLs (t.co/xyz) resolve to t.me/{username}
|
||||||
|
- Integrated as Method 5b in candidate finding pipeline (find_twitter_candidates.py:554-574)
|
||||||
|
- Baseline confidence: 0.95 (very high - explicit link in bio)
|
||||||
|
|
||||||
|
**Impact**:
|
||||||
|
- 140+ potential new matches identified
|
||||||
|
- Captures matches like Twitter @Block_Flash → Telegram @bull_flash
|
||||||
|
- Especially valuable when usernames differ but user explicitly links profiles
|
||||||
|
|
||||||
|
**Example Match**:
|
||||||
|
```
|
||||||
|
Twitter: @Block_Flash (ID: 63429948)
|
||||||
|
Telegram: @bull_flash
|
||||||
|
Method: twitter_bio_url_resolves_to_telegram
|
||||||
|
Confidence: 0.95
|
||||||
|
Resolved URL: https://t.me/bull_flash
|
||||||
|
Original URL: https://t.co/dc3iztSG9B
|
||||||
|
```
|
||||||
|
|
||||||
|
### URL Resolution Data Source
|
||||||
|
|
||||||
|
The `url_resolution_queue` table contains:
|
||||||
|
- 16,133 Twitter profiles with resolved Telegram URLs
|
||||||
|
- Shortened URLs from Twitter bios (t.co/xyz)
|
||||||
|
- Resolved destinations (full URLs)
|
||||||
|
- Extracted Telegram handles (JSONB array)
|
||||||
|
|
||||||
|
### Files Modified
|
||||||
|
|
||||||
|
1. **find_twitter_candidates.py**
|
||||||
|
- Added `find_by_resolved_url()` method
|
||||||
|
- Integrated into `find_candidates_for_contact()` as Method 5b
|
||||||
|
- Fixed type casting for twitter_id (VARCHAR) to user.id (BIGINT) join
|
||||||
|
|
||||||
|
2. **main.py** (ProfileMatching module)
|
||||||
|
- Created comprehensive interactive menu
|
||||||
|
- Real-time statistics display
|
||||||
|
- Streamlined workflow for candidate finding and LLM verification
|
||||||
|
|
||||||
|
3. **README.md**
|
||||||
|
- Complete documentation of all 10 matching methods
|
||||||
|
- Usage examples and performance metrics
|
||||||
|
- Configuration and troubleshooting guides
|
||||||
|
|
||||||
|
4. **UnifiedContacts/main.py**
|
||||||
|
- Added option 15: "Open Profile Matching System (Interactive Menu)"
|
||||||
|
- Reorganized menu to separate ProfileMatching from individual Twitter steps
|
||||||
|
|
||||||
|
### Current Statistics (as of 2025-11-04)
|
||||||
|
|
||||||
|
**Candidates**:
|
||||||
|
- Users with candidates: 38,121
|
||||||
|
- Total candidates found: 253,117
|
||||||
|
- Processed by LLM: 253,085
|
||||||
|
- Pending verification: 32
|
||||||
|
|
||||||
|
**Verified Matches**:
|
||||||
|
- Users with matches: 25,662
|
||||||
|
- Total matches: 36,147
|
||||||
|
- Average confidence: 0.74
|
||||||
|
- High confidence (90%+): 12,031
|
||||||
|
- Medium confidence (80-89%): 1,452
|
||||||
|
- Low confidence (70-79%): 7,505
|
||||||
|
|
||||||
|
### LLM Verification (V6 Prompt)
|
||||||
|
|
||||||
|
Current prompt improvements:
|
||||||
|
- Upfront directive for comparative evaluation
|
||||||
|
- Clear signal strength hierarchy (Very Strong, Strong Supporting, Weak, Red Flags)
|
||||||
|
- Company vs personal account differentiation
|
||||||
|
- Streamlined from ~135 to ~90 lines while being clearer
|
||||||
|
- Emphasis on evaluating ALL candidates together
|
||||||
|
|
||||||
|
### Performance Metrics
|
||||||
|
|
||||||
|
**Candidate Finding (Threaded)**:
|
||||||
|
- Speed: ~1.5 contacts/sec
|
||||||
|
- Time for 43K contacts: ~16-18 hours
|
||||||
|
- Workers: 8 (default)
|
||||||
|
|
||||||
|
**LLM Verification (Async)**:
|
||||||
|
- Speed: ~32 users/minute (100 concurrent requests)
|
||||||
|
- Cost: ~$0.003 per user (GPT-5-mini)
|
||||||
|
- Time for 43K users: ~23 hours
|
||||||
|
|
||||||
|
### Module Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
ProfileMatching/
|
||||||
|
├── main.py # Interactive menu system
|
||||||
|
├── README.md # Complete documentation
|
||||||
|
├── CHANGELOG.md # This file
|
||||||
|
├── find_twitter_candidates.py # Core matching logic (10 methods)
|
||||||
|
├── find_twitter_candidates_threaded.py # Threaded implementation
|
||||||
|
├── verify_twitter_matches_v2.py # LLM verification (V6 prompt)
|
||||||
|
├── review_match_quality.py # Analysis tools
|
||||||
|
└── setup_twitter_matching_schema.sql # Database schema
|
||||||
|
```
|
||||||
|
|
||||||
|
### 10 Matching Methods Summary
|
||||||
|
|
||||||
|
1. **Phash Match** (0.95/0.88) - Profile picture similarity
|
||||||
|
2. **Exact Bio Handle** (0.95) - Twitter handle extracted from Telegram bio
|
||||||
|
3. **Bio URL Resolution** (0.95) ⭐ NEW - Shortened URL resolves to Telegram
|
||||||
|
4. **Twitter Bio Has Telegram** (0.92) - Twitter bio mentions Telegram username
|
||||||
|
5. **Display Name Containment** (0.92) - TG name in TW name
|
||||||
|
6. **Exact Username** (0.90) - Usernames match exactly
|
||||||
|
7. **TG Username in Twitter Name** (0.88)
|
||||||
|
8. **Twitter Username in TG Name** (0.86)
|
||||||
|
9. **Fuzzy Name** (0.65-0.85) - Trigram similarity
|
||||||
|
10. **Username Variation** (0.80) - Generated username variations
|
||||||
|
|
||||||
|
### Testing
|
||||||
|
|
||||||
|
All changes tested with:
|
||||||
|
- Standalone method testing (find_by_resolved_url)
|
||||||
|
- Full integration testing (find_candidates_for_contact)
|
||||||
|
- Verified deduplication works correctly
|
||||||
|
- Confirmed matches with different usernames are captured
|
||||||
|
|
||||||
|
### Next Steps
|
||||||
|
|
||||||
|
Potential future enhancements:
|
||||||
|
- Add more matching methods (location, bio keywords, mutual connections)
|
||||||
|
- Implement feedback loop for prompt improvement
|
||||||
|
- Add manual review interface for borderline matches
|
||||||
|
- Export matches to various formats
|
||||||
|
- Additional URL resolution sources beyond Twitter bios
|
||||||
|
|
||||||
|
### Migration Notes
|
||||||
|
|
||||||
|
**For existing deployments**:
|
||||||
|
1. No database schema changes required
|
||||||
|
2. Existing `url_resolution_queue` table is used as-is
|
||||||
|
3. Scripts in `scripts/` folder remain unchanged and functional
|
||||||
|
4. New ProfileMatching module is additive, doesn't break existing workflows
|
||||||
|
|
||||||
|
**To use new features**:
|
||||||
|
1. Use ProfileMatching/main.py instead of individual scripts
|
||||||
|
2. Or run scripts directly from ProfileMatching folder
|
||||||
|
3. Or update import paths to use ProfileMatching module
|
||||||
218
QUICKSTART.md
Normal file
218
QUICKSTART.md
Normal file
@@ -0,0 +1,218 @@
|
|||||||
|
# ProfileMatching Quick Start Guide
|
||||||
|
|
||||||
|
## 🚀 Fastest Way to Get Started
|
||||||
|
|
||||||
|
### Option 1: Interactive Menu (RECOMMENDED)
|
||||||
|
```bash
|
||||||
|
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
|
||||||
|
python3.10 main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
This gives you an interactive menu with:
|
||||||
|
- Real-time statistics
|
||||||
|
- Guided workflow
|
||||||
|
- Easy access to all features
|
||||||
|
|
||||||
|
### Option 2: Launch from Main UnifiedContacts Menu
|
||||||
|
```bash
|
||||||
|
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts
|
||||||
|
python3.10 main.py
|
||||||
|
# Select option 15: "Open Profile Matching System"
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📋 Typical Workflow
|
||||||
|
|
||||||
|
### Step 1: Find Candidates (First Time)
|
||||||
|
If you haven't found candidates yet, run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From ProfileMatching folder
|
||||||
|
python3.10 find_twitter_candidates_threaded.py --workers 8
|
||||||
|
|
||||||
|
# Or use interactive menu: Option 1
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected time**: ~16-18 hours for all 43K contacts
|
||||||
|
|
||||||
|
### Step 2: Verify with LLM
|
||||||
|
After candidates are found, verify them:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From ProfileMatching folder
|
||||||
|
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
|
||||||
|
|
||||||
|
# Or use interactive menu: Option 3
|
||||||
|
```
|
||||||
|
|
||||||
|
**Expected time**: ~23 hours for all users
|
||||||
|
**Cost**: ~$130 for all users (GPT-5-mini at $0.003/user)
|
||||||
|
|
||||||
|
### Step 3: Review Results
|
||||||
|
```bash
|
||||||
|
# From ProfileMatching folder
|
||||||
|
python3.10 review_match_quality.py
|
||||||
|
|
||||||
|
# Or use interactive menu: Option 5
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🧪 Test Mode (Recommended Before Full Run)
|
||||||
|
|
||||||
|
Always test with a small batch first:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test with 50 users
|
||||||
|
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
|
||||||
|
|
||||||
|
# Or use interactive menu: Option 4
|
||||||
|
```
|
||||||
|
|
||||||
|
This helps you:
|
||||||
|
- Verify the system is working correctly
|
||||||
|
- Check match quality before spending on full run
|
||||||
|
- Estimate costs and timing
|
||||||
|
|
||||||
|
## 📊 Check Current Status
|
||||||
|
|
||||||
|
At any time, you can check where you're at:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Launch interactive menu and select Option 6: "Show statistics only"
|
||||||
|
python3.10 main.py
|
||||||
|
# Press 6, then 0 to exit
|
||||||
|
```
|
||||||
|
|
||||||
|
Or query directly:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
psql -d telegram_contacts -U andrewjiang -c "
|
||||||
|
SELECT
|
||||||
|
COUNT(DISTINCT telegram_user_id) as users_with_candidates,
|
||||||
|
COUNT(*) as total_candidates,
|
||||||
|
COUNT(*) FILTER (WHERE llm_processed = TRUE) as processed,
|
||||||
|
COUNT(*) FILTER (WHERE llm_processed = FALSE) as pending
|
||||||
|
FROM twitter_match_candidates;
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔄 Re-Running After Updates
|
||||||
|
|
||||||
|
If you've updated the LLM prompt or matching logic:
|
||||||
|
|
||||||
|
### Re-find Candidates (if matching logic changed)
|
||||||
|
```bash
|
||||||
|
# Delete old candidates
|
||||||
|
psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_match_candidates CASCADE;"
|
||||||
|
|
||||||
|
# Re-run candidate finding
|
||||||
|
python3.10 find_twitter_candidates_threaded.py --workers 8
|
||||||
|
```
|
||||||
|
|
||||||
|
### Re-verify with New Prompt (if only prompt changed)
|
||||||
|
```bash
|
||||||
|
# Reset LLM processing flag
|
||||||
|
psql -d telegram_contacts -U andrewjiang -c "UPDATE twitter_match_candidates SET llm_processed = FALSE;"
|
||||||
|
|
||||||
|
# Delete old matches
|
||||||
|
psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_telegram_matches;"
|
||||||
|
|
||||||
|
# Re-run verification
|
||||||
|
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🎯 Most Common Commands
|
||||||
|
|
||||||
|
### Find candidates for first 1000 contacts (testing)
|
||||||
|
```bash
|
||||||
|
python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify matches for pending candidates
|
||||||
|
```bash
|
||||||
|
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check match quality distribution
|
||||||
|
```bash
|
||||||
|
python3.10 review_match_quality.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Export matches to CSV (coming soon)
|
||||||
|
```bash
|
||||||
|
# Will be added in future update
|
||||||
|
```
|
||||||
|
|
||||||
|
## 💡 Pro Tips
|
||||||
|
|
||||||
|
1. **Always use threaded candidate finding** - It's 10-20x faster
|
||||||
|
2. **Use high concurrency for verification** - 100-200 concurrent requests for optimal speed
|
||||||
|
3. **Test first** - Always run with `--test --limit 50` before full runs
|
||||||
|
4. **Monitor costs** - Check OpenAI dashboard during verification
|
||||||
|
5. **Check the stats** - Use Option 6 in interactive menu to monitor progress
|
||||||
|
|
||||||
|
## 🐛 Troubleshooting
|
||||||
|
|
||||||
|
### "No candidates found"
|
||||||
|
- Check if Twitter database has data: `psql -d twitter_data -c "SELECT COUNT(*) FROM users;"`
|
||||||
|
- Check Telegram contacts: `psql -d telegram_contacts -c "SELECT COUNT(*) FROM contacts;"`
|
||||||
|
|
||||||
|
### "LLM verification is slow"
|
||||||
|
- Increase `--concurrent` parameter (try 150-200)
|
||||||
|
- Check OpenAI rate limits in dashboard
|
||||||
|
- Verify network connection
|
||||||
|
|
||||||
|
### "Too many low-quality matches"
|
||||||
|
- Review the V6 prompt in `verify_twitter_matches_v2.py`
|
||||||
|
- Run `review_match_quality.py` to analyze
|
||||||
|
- Consider adjusting confidence thresholds
|
||||||
|
|
||||||
|
### "Missing obvious matches"
|
||||||
|
- Check if candidate was found:
|
||||||
|
```sql
|
||||||
|
SELECT * FROM twitter_match_candidates WHERE telegram_user_id = YOUR_USER_ID;
|
||||||
|
```
|
||||||
|
- If found but not verified, check `llm_verdict` field for reasoning
|
||||||
|
- If not found at all, may need new matching method
|
||||||
|
|
||||||
|
## 📚 More Information
|
||||||
|
|
||||||
|
- See `README.md` for complete documentation
|
||||||
|
- See `CHANGELOG.md` for recent updates
|
||||||
|
- See individual script files for command-line options
|
||||||
|
|
||||||
|
## 🆘 Need Help?
|
||||||
|
|
||||||
|
Common issues and solutions:
|
||||||
|
|
||||||
|
| Issue | Solution |
|
||||||
|
|-------|----------|
|
||||||
|
| Import errors | Make sure you're using python3.10 |
|
||||||
|
| Database connection errors | Check PostgreSQL is running: `pg_isready` |
|
||||||
|
| OpenAI API errors | Verify API key in `.env` file |
|
||||||
|
| Out of memory | Reduce concurrent requests or use batching |
|
||||||
|
|
||||||
|
## 🎓 Understanding the Output
|
||||||
|
|
||||||
|
### Candidate Finding Output
|
||||||
|
```
|
||||||
|
Processing contact 1000/43000 (2.3%)
|
||||||
|
Found 6 candidates for @username
|
||||||
|
• exact_username: @username (0.90)
|
||||||
|
• fuzzy_name: @similar_name (0.75)
|
||||||
|
```
|
||||||
|
|
||||||
|
### LLM Verification Output
|
||||||
|
```
|
||||||
|
[Progress] 500/1000 users (50.0%) | 125 matches | ~$1.50 | 25.0 users/min
|
||||||
|
```
|
||||||
|
|
||||||
|
### Match Quality Review
|
||||||
|
```
|
||||||
|
Total users with matches: 25,662
|
||||||
|
Total matches: 36,147
|
||||||
|
Average confidence: 0.74
|
||||||
|
|
||||||
|
Confidence Distribution:
|
||||||
|
90%+: 12,031 matches (HIGH)
|
||||||
|
80-89%: 1,452 matches (MEDIUM)
|
||||||
|
70-79%: 7,505 matches (LOW)
|
||||||
|
```
|
||||||
207
README.md
Normal file
207
README.md
Normal file
@@ -0,0 +1,207 @@
|
|||||||
|
# Twitter-Telegram Profile Matching System
|
||||||
|
|
||||||
|
A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This system operates in two main steps:
|
||||||
|
|
||||||
|
1. **Candidate Finding**: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
|
||||||
|
2. **LLM Verification**: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
|
||||||
|
python3.10 main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Matching Methods
|
||||||
|
|
||||||
|
The system uses 10 different methods to find Twitter candidates:
|
||||||
|
|
||||||
|
### High Confidence Methods (0.90-0.95)
|
||||||
|
|
||||||
|
1. **Phash Match** (0.95 for exact, 0.88 for distance=1)
|
||||||
|
- Compares profile picture hashes
|
||||||
|
- Pre-computed in `telegram_twitter_phash_matches` table
|
||||||
|
|
||||||
|
2. **Exact Bio Handle** (0.95)
|
||||||
|
- Extracts Twitter handles from Telegram bio
|
||||||
|
- Patterns: `@username`, `twitter.com/username`, `x.com/username`
|
||||||
|
|
||||||
|
3. **Bio URL Resolution** (0.95) ⭐ NEW
|
||||||
|
- Twitter bio contains shortened URL (t.co/xyz) that resolves to `t.me/username`
|
||||||
|
- Queries `url_resolution_queue` table
|
||||||
|
- Captures matches even when usernames differ
|
||||||
|
|
||||||
|
4. **Twitter Bio Has Telegram** (0.92)
|
||||||
|
- Reverse lookup: Twitter bio mentions Telegram username
|
||||||
|
- Searches for `@username`, `t.me/username`, `telegram.me/username`
|
||||||
|
|
||||||
|
5. **Display Name Containment** (0.92)
|
||||||
|
- Telegram name contained within Twitter display name
|
||||||
|
|
||||||
|
6. **Exact Username** (0.90)
|
||||||
|
- Telegram username exactly matches Twitter username
|
||||||
|
|
||||||
|
### Medium Confidence Methods (0.80-0.88)
|
||||||
|
|
||||||
|
7. **TG Username in Twitter Name** (0.88)
|
||||||
|
8. **Twitter Username in TG Name** (0.86)
|
||||||
|
9. **Fuzzy Name** (0.65-0.85)
|
||||||
|
- PostgreSQL trigram similarity with 0.65 threshold
|
||||||
|
10. **Username Variation** (0.80)
|
||||||
|
- Generates variations (remove underscores, flip numbers, etc.)
|
||||||
|
|
||||||
|
## LLM Verification
|
||||||
|
|
||||||
|
The system uses GPT-5-mini with a sophisticated V6 prompt that:
|
||||||
|
|
||||||
|
- Evaluates ALL candidates together (comparative evaluation)
|
||||||
|
- Applies differential scoring (only one can be "most likely")
|
||||||
|
- Distinguishes between personal and company accounts
|
||||||
|
- Considers signal strength holistically
|
||||||
|
- Only saves matches with 70%+ confidence
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
### Core Scripts
|
||||||
|
|
||||||
|
- `main.py` - Interactive menu for running the system
|
||||||
|
- `find_twitter_candidates.py` - Core matching logic (TwitterMatcher class)
|
||||||
|
- `find_twitter_candidates_threaded.py` - Threaded implementation (RECOMMENDED)
|
||||||
|
- `verify_twitter_matches_v2.py` - LLM verification with async (RECOMMENDED)
|
||||||
|
- `review_match_quality.py` - Analyze match quality and statistics
|
||||||
|
|
||||||
|
### Database Schema
|
||||||
|
|
||||||
|
- `setup_twitter_matching_schema.sql` - Database tables and indexes
|
||||||
|
|
||||||
|
## Database Tables
|
||||||
|
|
||||||
|
### `twitter_match_candidates`
|
||||||
|
Stores all potential matches found by the matching methods.
|
||||||
|
|
||||||
|
**Key fields:**
|
||||||
|
- `telegram_user_id` - Telegram contact user ID
|
||||||
|
- `twitter_id` - Twitter profile ID
|
||||||
|
- `match_method` - Which method found this candidate
|
||||||
|
- `baseline_confidence` - Initial confidence (0.0-1.0)
|
||||||
|
- `match_signals` - JSON with match details
|
||||||
|
- `llm_processed` - Whether LLM has evaluated this candidate
|
||||||
|
|
||||||
|
### `twitter_telegram_matches`
|
||||||
|
Stores verified matches (70%+ confidence from LLM).
|
||||||
|
|
||||||
|
**Key fields:**
|
||||||
|
- `telegram_user_id` - Telegram contact
|
||||||
|
- `twitter_id` - Matched Twitter profile
|
||||||
|
- `final_confidence` - LLM-assigned confidence (0.70-1.0)
|
||||||
|
- `llm_verdict` - LLM reasoning
|
||||||
|
- `match_method` - Original matching method
|
||||||
|
- `matched_at` - Timestamp
|
||||||
|
|
||||||
|
### `url_resolution_queue`
|
||||||
|
Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).
|
||||||
|
|
||||||
|
**Key fields:**
|
||||||
|
- `twitter_id` - Twitter profile ID
|
||||||
|
- `original_url` - Shortened URL (e.g., t.co/abc)
|
||||||
|
- `resolved_url` - Full URL (e.g., https://t.me/username)
|
||||||
|
- `telegram_handles` - Extracted Telegram handles (JSONB array)
|
||||||
|
|
||||||
|
## Usage Examples
|
||||||
|
|
||||||
|
### Find Candidates for All Contacts (Threaded)
|
||||||
|
```bash
|
||||||
|
python3.10 find_twitter_candidates_threaded.py --workers 8
|
||||||
|
```
|
||||||
|
|
||||||
|
### Find Candidates for First 1000 Contacts
|
||||||
|
```bash
|
||||||
|
python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify Matches with LLM (100 concurrent requests)
|
||||||
|
```bash
|
||||||
|
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test Mode (50 users, 10 concurrent)
|
||||||
|
```bash
|
||||||
|
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
|
||||||
|
```
|
||||||
|
|
||||||
|
### Review Match Quality
|
||||||
|
```bash
|
||||||
|
python3.10 review_match_quality.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
### Candidate Finding (Threaded)
|
||||||
|
- **Speed**: ~1.5 contacts/sec
|
||||||
|
- **Time for 43K contacts**: ~16-18 hours
|
||||||
|
- **Workers**: 8 (default, configurable)
|
||||||
|
|
||||||
|
### LLM Verification (Async)
|
||||||
|
- **Speed**: ~32 users/minute with 100 concurrent requests
|
||||||
|
- **Cost**: ~$0.003 per user (GPT-5-mini)
|
||||||
|
- **Time for 43K users**: ~23 hours
|
||||||
|
|
||||||
|
## Recent Improvements
|
||||||
|
|
||||||
|
### V6 Prompt (Latest)
|
||||||
|
- Upfront directive for comparative evaluation
|
||||||
|
- Clear signal strength hierarchy
|
||||||
|
- Company vs personal account differentiation
|
||||||
|
- Streamlined from ~135 to ~90 lines while being clearer
|
||||||
|
|
||||||
|
### URL Resolution Integration
|
||||||
|
- Added Method 5b: Bio URL resolution
|
||||||
|
- Captures 140+ additional matches
|
||||||
|
- Especially valuable when usernames differ
|
||||||
|
- 0.95 baseline confidence (very high)
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Environment variables (in `/Users/andrewjiang/Bao/TimeToLockIn/Profile/.env`):
|
||||||
|
```
|
||||||
|
OPENAI_API_KEY=your_key_here
|
||||||
|
OPENAI_MODEL=gpt-5-mini
|
||||||
|
```
|
||||||
|
|
||||||
|
Database connections:
|
||||||
|
- `telegram_contacts` - Telegram contact data
|
||||||
|
- `twitter_data` - Twitter profile data
|
||||||
|
|
||||||
|
## Tips
|
||||||
|
|
||||||
|
1. **Always run threaded candidate finding** - 10-20x faster than single-threaded
|
||||||
|
2. **Use high concurrency for LLM verification** - 100+ concurrent requests for optimal speed
|
||||||
|
3. **Monitor costs** - Check OpenAI usage during verification
|
||||||
|
4. **Review match quality periodically** - Use `review_match_quality.py` to analyze results
|
||||||
|
5. **Test first** - Use `--test --limit 50` flags before full runs
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### LLM verification is slow
|
||||||
|
- Increase `--concurrent` parameter (try 100-200)
|
||||||
|
- Check OpenAI rate limits (1,000 RPM for Tier 1)
|
||||||
|
|
||||||
|
### Many low-quality matches
|
||||||
|
- Review and adjust V6 prompt in `verify_twitter_matches_v2.py`
|
||||||
|
- Check `review_match_quality.py` for insights
|
||||||
|
|
||||||
|
### Missing obvious matches
|
||||||
|
- Check if candidate was found: Query `twitter_match_candidates`
|
||||||
|
- If not found, may need new matching method
|
||||||
|
- If found but not verified, check LLM reasoning in `llm_verdict`
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
- Add more matching methods (location, bio keywords, etc.)
|
||||||
|
- Implement feedback loop for prompt improvement
|
||||||
|
- Add manual review interface for borderline matches
|
||||||
|
- Export matches to various formats
|
||||||
851
find_twitter_candidates.py
Executable file
851
find_twitter_candidates.py
Executable file
@@ -0,0 +1,851 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Twitter-Telegram Candidate Finder
|
||||||
|
Finds potential Twitter matches for Telegram contacts using:
|
||||||
|
1. Handle extraction from bios
|
||||||
|
2. Username variation generation
|
||||||
|
3. Fuzzy name matching
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import re
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Dict, Set, Tuple
|
||||||
|
import psycopg2
|
||||||
|
from psycopg2.extras import DictCursor, execute_values
|
||||||
|
|
||||||
|
# Add parent directory to path
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
|
||||||
|
|
||||||
|
from db_config import SessionLocal
|
||||||
|
from models import Contact
|
||||||
|
|
||||||
|
# Twitter database connection (adjust as needed)
|
||||||
|
TWITTER_DB_CONFIG = {
|
||||||
|
'dbname': 'twitter_data',
|
||||||
|
'user': 'andrewjiang', # Adjust to your setup
|
||||||
|
'host': 'localhost',
|
||||||
|
'port': 5432
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class HandleExtractor:
|
||||||
|
"""Extract Twitter handles from text"""
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def extract_handles(text: str) -> List[str]:
|
||||||
|
"""Extract Twitter handles from bio/text"""
|
||||||
|
if not text:
|
||||||
|
return []
|
||||||
|
|
||||||
|
handles = set()
|
||||||
|
|
||||||
|
# Pattern 1: @username
|
||||||
|
pattern1 = r'@([a-zA-Z0-9_]{4,15})'
|
||||||
|
handles.update(re.findall(pattern1, text))
|
||||||
|
|
||||||
|
# Pattern 2: twitter.com/username or x.com/username
|
||||||
|
pattern2 = r'(?:twitter\.com|x\.com)/([a-zA-Z0-9_]{4,15})'
|
||||||
|
handles.update(re.findall(pattern2, text, re.IGNORECASE))
|
||||||
|
|
||||||
|
# Pattern 3: Clean standalone handles (risky, be conservative)
|
||||||
|
# Only if text is short and looks like a handle
|
||||||
|
if len(text) < 30 and text.count('@') == 1:
|
||||||
|
clean = text.strip('@').strip()
|
||||||
|
if re.match(r'^[a-zA-Z0-9_]{4,15}$', clean):
|
||||||
|
handles.add(clean)
|
||||||
|
|
||||||
|
return [h.lower() for h in handles if len(h) >= 4]
|
||||||
|
|
||||||
|
|
||||||
|
class UsernameVariationGenerator:
|
||||||
|
"""Generate Twitter handle variations from Telegram usernames"""
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def generate_variations(telegram_username: str) -> List[str]:
|
||||||
|
"""
|
||||||
|
Generate possible Twitter handle variations
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
- alice_0x → [alice_0x, 0xalice, 0x_alice, alice0x]
|
||||||
|
- trader_69 → [trader_69, 69trader, 69_trader, trader69]
|
||||||
|
"""
|
||||||
|
if not telegram_username:
|
||||||
|
return []
|
||||||
|
|
||||||
|
variations = [telegram_username.lower()]
|
||||||
|
|
||||||
|
# Remove underscores
|
||||||
|
no_underscore = telegram_username.replace('_', '')
|
||||||
|
if no_underscore != telegram_username and len(no_underscore) >= 4:
|
||||||
|
variations.append(no_underscore.lower())
|
||||||
|
|
||||||
|
# Handle "0x" patterns (common in crypto)
|
||||||
|
if '0x' in telegram_username.lower():
|
||||||
|
# If 0x at end, try moving to front
|
||||||
|
if telegram_username.lower().endswith('_0x'):
|
||||||
|
base = telegram_username[:-3]
|
||||||
|
variations.extend([
|
||||||
|
f"0x{base}".lower(),
|
||||||
|
f"0x_{base}".lower()
|
||||||
|
])
|
||||||
|
elif telegram_username.lower().endswith('0x'):
|
||||||
|
base = telegram_username[:-2]
|
||||||
|
variations.extend([
|
||||||
|
f"0x{base}".lower(),
|
||||||
|
f"0x_{base}".lower()
|
||||||
|
])
|
||||||
|
|
||||||
|
# Try without underscore
|
||||||
|
no_under = telegram_username.replace('_', '')
|
||||||
|
if '0x' in no_under.lower() and no_under.lower() not in variations:
|
||||||
|
variations.append(no_under.lower())
|
||||||
|
|
||||||
|
# Handle trailing numbers (alice_69 → 69alice)
|
||||||
|
match = re.match(r'^([a-z_]+?)_?(\d+)$', telegram_username, re.IGNORECASE)
|
||||||
|
if match:
|
||||||
|
prefix, number = match.groups()
|
||||||
|
prefix = prefix.rstrip('_')
|
||||||
|
if len(f"{number}{prefix}") >= 4:
|
||||||
|
variations.extend([
|
||||||
|
f"{number}{prefix}".lower(),
|
||||||
|
f"{number}_{prefix}".lower()
|
||||||
|
])
|
||||||
|
|
||||||
|
# Handle leading numbers (69_alice → alice69)
|
||||||
|
match = re.match(r'^(\d+)_?([a-z_]+)$', telegram_username, re.IGNORECASE)
|
||||||
|
if match:
|
||||||
|
number, suffix = match.groups()
|
||||||
|
suffix = suffix.lstrip('_')
|
||||||
|
if len(f"{suffix}{number}") >= 4:
|
||||||
|
variations.extend([
|
||||||
|
f"{suffix}{number}".lower(),
|
||||||
|
f"{suffix}_{number}".lower()
|
||||||
|
])
|
||||||
|
|
||||||
|
# Single character removals (banteg → bantg, trader → trade)
|
||||||
|
# This catches shortened versions of usernames
|
||||||
|
base = telegram_username.lower()
|
||||||
|
if len(base) > 4: # Only if result would be at least 4 chars
|
||||||
|
for i in range(len(base)):
|
||||||
|
variation = base[:i] + base[i+1:]
|
||||||
|
if len(variation) >= 4:
|
||||||
|
variations.append(variation)
|
||||||
|
|
||||||
|
# Deduplicate and validate (4-15 chars for Twitter)
|
||||||
|
valid = []
|
||||||
|
for v in set(variations):
|
||||||
|
v_clean = v.strip('_')
|
||||||
|
if 4 <= len(v_clean) <= 15:
|
||||||
|
valid.append(v_clean)
|
||||||
|
|
||||||
|
return list(set(valid))
|
||||||
|
|
||||||
|
|
||||||
|
class TwitterMatcher:
|
||||||
|
"""Find Twitter profiles matching Telegram contacts"""
|
||||||
|
|
||||||
|
def __init__(self, twitter_conn, telegram_conn=None):
|
||||||
|
self.twitter_conn = twitter_conn
|
||||||
|
self.telegram_conn = telegram_conn # Needed for phash lookups
|
||||||
|
self.handle_extractor = HandleExtractor()
|
||||||
|
self.variation_generator = UsernameVariationGenerator()
|
||||||
|
|
||||||
|
def find_by_handle(self, handle: str) -> Dict:
|
||||||
|
"""Lookup Twitter profile by exact handle"""
|
||||||
|
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
id,
|
||||||
|
username,
|
||||||
|
name,
|
||||||
|
description,
|
||||||
|
location,
|
||||||
|
verified,
|
||||||
|
is_blue_verified,
|
||||||
|
followers_count,
|
||||||
|
following_count,
|
||||||
|
created_at
|
||||||
|
FROM public.users
|
||||||
|
WHERE LOWER(username) = %s
|
||||||
|
LIMIT 1
|
||||||
|
""", (handle.lower(),))
|
||||||
|
|
||||||
|
result = cur.fetchone()
|
||||||
|
return dict(result) if result else None
|
||||||
|
|
||||||
|
def find_by_fuzzy_name(self, telegram_name: str, limit=3) -> List[Dict]:
|
||||||
|
"""Find Twitter profiles with similar names using fuzzy matching"""
|
||||||
|
if not telegram_name or len(telegram_name) < 3:
|
||||||
|
return []
|
||||||
|
|
||||||
|
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
|
||||||
|
# Use parameterized query with proper escaping for % operator
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
id,
|
||||||
|
username,
|
||||||
|
name,
|
||||||
|
description,
|
||||||
|
location,
|
||||||
|
verified,
|
||||||
|
is_blue_verified,
|
||||||
|
followers_count,
|
||||||
|
following_count,
|
||||||
|
created_at,
|
||||||
|
similarity(name, %(name)s) AS name_score
|
||||||
|
FROM public.users
|
||||||
|
WHERE name %% %(name)s -- %% for similarity operator (escaped in string)
|
||||||
|
AND similarity(name, %(name)s) > 0.65 -- Increased threshold from 0.5 to reduce noise
|
||||||
|
ORDER BY name_score DESC
|
||||||
|
LIMIT %(limit)s
|
||||||
|
""", {'name': telegram_name, 'limit': limit})
|
||||||
|
|
||||||
|
return [dict(row) for row in cur.fetchall()]
|
||||||
|
|
||||||
|
def find_by_display_name_containment(self, telegram_name: str, limit=5) -> List[Dict]:
|
||||||
|
"""Find Twitter profiles where TG display name is contained in TW display name"""
|
||||||
|
if not telegram_name or len(telegram_name) < 3:
|
||||||
|
return []
|
||||||
|
|
||||||
|
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
|
||||||
|
# Direct containment search (case-insensitive)
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
id,
|
||||||
|
username,
|
||||||
|
name,
|
||||||
|
description,
|
||||||
|
location,
|
||||||
|
verified,
|
||||||
|
is_blue_verified,
|
||||||
|
followers_count,
|
||||||
|
following_count,
|
||||||
|
created_at
|
||||||
|
FROM public.users
|
||||||
|
WHERE name ILIKE %s -- TG name contained in TW name
|
||||||
|
AND LENGTH(name) >= %s -- TW name must be at least as long as TG name
|
||||||
|
ORDER BY followers_count DESC -- Prioritize by follower count
|
||||||
|
LIMIT %s
|
||||||
|
""", (f'%{telegram_name}%', len(telegram_name), limit))
|
||||||
|
|
||||||
|
return [dict(row) for row in cur.fetchall()]
|
||||||
|
|
||||||
|
def find_by_phash_match(self, telegram_user_id: int) -> List[Dict]:
|
||||||
|
"""Find Twitter profiles with matching profile picture phash (distance 0-1 only)"""
|
||||||
|
if not self.telegram_conn:
|
||||||
|
return []
|
||||||
|
|
||||||
|
with self.telegram_conn.cursor(cursor_factory=DictCursor) as cur:
|
||||||
|
# Query pre-computed phash matches (distance 0-1 for high confidence)
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
m.twitter_user_id,
|
||||||
|
m.twitter_username,
|
||||||
|
m.hamming_distance,
|
||||||
|
m.telegram_phash,
|
||||||
|
m.twitter_phash
|
||||||
|
FROM telegram_twitter_phash_matches m
|
||||||
|
WHERE m.telegram_user_id = %s
|
||||||
|
AND m.hamming_distance <= 1 -- Only exact and distance-1 matches
|
||||||
|
ORDER BY m.hamming_distance ASC
|
||||||
|
LIMIT 5
|
||||||
|
""", (telegram_user_id,))
|
||||||
|
|
||||||
|
phash_matches = cur.fetchall()
|
||||||
|
|
||||||
|
# Commit to close transaction
|
||||||
|
self.telegram_conn.commit()
|
||||||
|
|
||||||
|
if not phash_matches:
|
||||||
|
return []
|
||||||
|
|
||||||
|
# Fetch full Twitter profile data for matched users
|
||||||
|
twitter_ids = [m['twitter_user_id'] for m in phash_matches]
|
||||||
|
|
||||||
|
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
id,
|
||||||
|
username,
|
||||||
|
name,
|
||||||
|
description,
|
||||||
|
location,
|
||||||
|
verified,
|
||||||
|
is_blue_verified,
|
||||||
|
followers_count,
|
||||||
|
following_count,
|
||||||
|
created_at
|
||||||
|
FROM public.users
|
||||||
|
WHERE id = ANY(%s)
|
||||||
|
""", (twitter_ids,))
|
||||||
|
|
||||||
|
twitter_profiles = [dict(row) for row in cur.fetchall()]
|
||||||
|
|
||||||
|
# Enrich with phash match details
|
||||||
|
twitter_profile_map = {str(p['id']): p for p in twitter_profiles}
|
||||||
|
results = []
|
||||||
|
|
||||||
|
for match in phash_matches:
|
||||||
|
tw_id = match['twitter_user_id']
|
||||||
|
if tw_id in twitter_profile_map:
|
||||||
|
profile = twitter_profile_map[tw_id].copy()
|
||||||
|
profile['phash_distance'] = match['hamming_distance']
|
||||||
|
profile['telegram_phash'] = match['telegram_phash']
|
||||||
|
profile['twitter_phash'] = match['twitter_phash']
|
||||||
|
results.append(profile)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
def find_by_telegram_in_twitter_bio(self, telegram_username: str, limit=3) -> List[Dict]:
|
||||||
|
"""Find Twitter profiles that mention this Telegram username in their bio (exact @mention only)"""
|
||||||
|
if not telegram_username or len(telegram_username) < 4:
|
||||||
|
return []
|
||||||
|
|
||||||
|
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
|
||||||
|
# Search for various Telegram handle patterns in Twitter bios
|
||||||
|
# Use regex with word boundaries to avoid substring matches in other handles
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
id,
|
||||||
|
username,
|
||||||
|
name,
|
||||||
|
description,
|
||||||
|
location,
|
||||||
|
verified,
|
||||||
|
is_blue_verified,
|
||||||
|
followers_count,
|
||||||
|
following_count,
|
||||||
|
created_at
|
||||||
|
FROM public.users
|
||||||
|
WHERE description IS NOT NULL
|
||||||
|
AND (
|
||||||
|
description ~* %s -- @username with word boundary (not part of longer handle)
|
||||||
|
OR LOWER(description) LIKE %s -- t.me/username
|
||||||
|
OR LOWER(description) LIKE %s -- telegram.me/username
|
||||||
|
)
|
||||||
|
ORDER BY followers_count DESC
|
||||||
|
LIMIT %s
|
||||||
|
""", (
|
||||||
|
r'@' + telegram_username + r'(\s|$|[^a-zA-Z0-9_])', # Word boundary: space, end, or non-alphanumeric
|
||||||
|
f'%t.me/{telegram_username.lower()}%',
|
||||||
|
f'%telegram.me/{telegram_username.lower()}%',
|
||||||
|
limit
|
||||||
|
))
|
||||||
|
|
||||||
|
return [dict(row) for row in cur.fetchall()]
|
||||||
|
|
||||||
|
def find_by_resolved_url(self, telegram_username: str) -> List[Dict]:
|
||||||
|
"""Find Twitter profiles whose bio URL resolves to this Telegram username"""
|
||||||
|
if not telegram_username or len(telegram_username) < 4:
|
||||||
|
return []
|
||||||
|
|
||||||
|
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
|
||||||
|
# Query url_resolution_queue for Twitter profiles with resolved Telegram links
|
||||||
|
cur.execute("""
|
||||||
|
SELECT DISTINCT
|
||||||
|
u.id,
|
||||||
|
u.username,
|
||||||
|
u.name,
|
||||||
|
u.description,
|
||||||
|
u.location,
|
||||||
|
u.verified,
|
||||||
|
u.is_blue_verified,
|
||||||
|
u.followers_count,
|
||||||
|
u.following_count,
|
||||||
|
u.created_at,
|
||||||
|
urq.resolved_url,
|
||||||
|
urq.original_url
|
||||||
|
FROM url_resolution_queue urq
|
||||||
|
JOIN public.users u ON u.id::text = urq.twitter_id
|
||||||
|
WHERE urq.telegram_handles IS NOT NULL
|
||||||
|
AND urq.telegram_handles @> %s::jsonb
|
||||||
|
ORDER BY u.followers_count DESC
|
||||||
|
LIMIT 3
|
||||||
|
""", (f'["{telegram_username.lower()}"]',))
|
||||||
|
|
||||||
|
results = cur.fetchall()
|
||||||
|
|
||||||
|
# Enrich with URL details
|
||||||
|
enriched = []
|
||||||
|
for row in results:
|
||||||
|
profile = dict(row)
|
||||||
|
profile['resolved_telegram_url'] = profile.pop('resolved_url')
|
||||||
|
profile['original_url'] = profile.pop('original_url')
|
||||||
|
enriched.append(profile)
|
||||||
|
|
||||||
|
return enriched
|
||||||
|
|
||||||
|
def find_by_username_in_display_name(self, search_term: str, is_telegram: bool, limit: int = 5) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
Find Twitter profiles where display name contains username pattern
|
||||||
|
|
||||||
|
If is_telegram=True: Search for TG username in Twitter display names
|
||||||
|
If is_telegram=False: Search for Twitter username pattern in TG display names (reverse: find Twitter profiles whose username matches the TG display name)
|
||||||
|
"""
|
||||||
|
with self.twitter_conn.cursor(cursor_factory=DictCursor) as cur:
|
||||||
|
if is_telegram:
|
||||||
|
# TG username in Twitter display name
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
id,
|
||||||
|
username,
|
||||||
|
name,
|
||||||
|
description,
|
||||||
|
location,
|
||||||
|
verified,
|
||||||
|
is_blue_verified,
|
||||||
|
followers_count,
|
||||||
|
following_count,
|
||||||
|
created_at
|
||||||
|
FROM public.users
|
||||||
|
WHERE name IS NOT NULL
|
||||||
|
AND LOWER(name) LIKE %s
|
||||||
|
ORDER BY followers_count DESC
|
||||||
|
LIMIT %s
|
||||||
|
""", (f'%{search_term.lower()}%', limit))
|
||||||
|
else:
|
||||||
|
# Twitter username in TG display name - search for Twitter profiles whose username appears in the search term
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
id,
|
||||||
|
username,
|
||||||
|
name,
|
||||||
|
description,
|
||||||
|
location,
|
||||||
|
verified,
|
||||||
|
is_blue_verified,
|
||||||
|
followers_count,
|
||||||
|
following_count,
|
||||||
|
created_at
|
||||||
|
FROM public.users
|
||||||
|
WHERE username IS NOT NULL
|
||||||
|
AND LOWER(%s) LIKE '%%' || LOWER(username) || '%%'
|
||||||
|
ORDER BY followers_count DESC
|
||||||
|
LIMIT %s
|
||||||
|
""", (search_term, limit))
|
||||||
|
|
||||||
|
return [dict(row) for row in cur.fetchall()]
|
||||||
|
|
||||||
|
def get_display_name(self, contact: Contact) -> str:
|
||||||
|
"""Get display name with fallback to first_name + last_name"""
|
||||||
|
if contact.display_name:
|
||||||
|
return contact.display_name
|
||||||
|
# Fallback to first_name + last_name
|
||||||
|
parts = [contact.first_name, contact.last_name]
|
||||||
|
return ' '.join(p for p in parts if p).strip()
|
||||||
|
|
||||||
|
def find_candidates_for_contact(self, contact: Contact) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
Find all Twitter candidates for a single Telegram contact
|
||||||
|
|
||||||
|
Returns list of candidates with:
|
||||||
|
{
|
||||||
|
'twitter_profile': {...},
|
||||||
|
'match_method': 'exact_bio_handle' | 'exact_username' | 'username_variation' | 'fuzzy_name',
|
||||||
|
'baseline_confidence': 0.0-1.0,
|
||||||
|
'match_details': {...}
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
candidates = []
|
||||||
|
seen_twitter_ids = set()
|
||||||
|
|
||||||
|
# Get display name with fallback
|
||||||
|
display_name = self.get_display_name(contact)
|
||||||
|
|
||||||
|
# Method 1: Phash matching (profile picture similarity)
|
||||||
|
phash_matches = self.find_by_phash_match(contact.user_id)
|
||||||
|
for twitter_profile in phash_matches:
|
||||||
|
phash_distance = twitter_profile.pop('phash_distance')
|
||||||
|
telegram_phash = twitter_profile.pop('telegram_phash')
|
||||||
|
twitter_phash = twitter_profile.pop('twitter_phash')
|
||||||
|
|
||||||
|
# Baseline confidence based on phash distance
|
||||||
|
if phash_distance == 0:
|
||||||
|
baseline_confidence = 0.95 # Exact match - VERY strong signal
|
||||||
|
elif phash_distance == 1:
|
||||||
|
baseline_confidence = 0.88 # 1-bit difference - strong signal
|
||||||
|
else:
|
||||||
|
continue # Skip distance > 1
|
||||||
|
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'phash_match',
|
||||||
|
'baseline_confidence': baseline_confidence,
|
||||||
|
'match_details': {
|
||||||
|
'phash_distance': phash_distance,
|
||||||
|
'telegram_phash': telegram_phash,
|
||||||
|
'twitter_phash': twitter_phash
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
# Method 2: Extract handles from bio
|
||||||
|
if contact.bio:
|
||||||
|
bio_handles = self.handle_extractor.extract_handles(contact.bio)
|
||||||
|
for handle in bio_handles:
|
||||||
|
twitter_profile = self.find_by_handle(handle)
|
||||||
|
if twitter_profile and twitter_profile['id'] not in seen_twitter_ids:
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'exact_bio_handle',
|
||||||
|
'baseline_confidence': 0.95,
|
||||||
|
'match_details': {
|
||||||
|
'extracted_handle': handle,
|
||||||
|
'from_bio': True
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
# Method 3: Exact username match
|
||||||
|
if contact.username:
|
||||||
|
twitter_profile = self.find_by_handle(contact.username)
|
||||||
|
if twitter_profile and twitter_profile['id'] not in seen_twitter_ids:
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'exact_username',
|
||||||
|
'baseline_confidence': 0.90,
|
||||||
|
'match_details': {
|
||||||
|
'telegram_username': contact.username,
|
||||||
|
'twitter_username': twitter_profile['username']
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
# Method 4: Username variations
|
||||||
|
if contact.username:
|
||||||
|
variations = self.variation_generator.generate_variations(contact.username)
|
||||||
|
for variation in variations:
|
||||||
|
if variation == contact.username.lower():
|
||||||
|
continue # Already checked in Method 2
|
||||||
|
|
||||||
|
twitter_profile = self.find_by_handle(variation)
|
||||||
|
if twitter_profile and twitter_profile['id'] not in seen_twitter_ids:
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'username_variation',
|
||||||
|
'baseline_confidence': 0.80,
|
||||||
|
'match_details': {
|
||||||
|
'telegram_username': contact.username,
|
||||||
|
'username_variation': variation,
|
||||||
|
'twitter_username': twitter_profile['username']
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
# Method 5: Twitter bio contains Telegram username (reverse lookup)
|
||||||
|
if contact.username:
|
||||||
|
reverse_matches = self.find_by_telegram_in_twitter_bio(contact.username, limit=3)
|
||||||
|
for twitter_profile in reverse_matches:
|
||||||
|
if twitter_profile['id'] not in seen_twitter_ids:
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'twitter_bio_has_telegram',
|
||||||
|
'baseline_confidence': 0.92, # Very high confidence - they explicitly mention their Telegram
|
||||||
|
'match_details': {
|
||||||
|
'telegram_username': contact.username,
|
||||||
|
'twitter_username': twitter_profile['username'],
|
||||||
|
'found_in_twitter_bio': True
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
# Method 5b: Twitter bio URL resolves to Telegram username (via url_resolution_queue)
|
||||||
|
if contact.username:
|
||||||
|
url_resolved_matches = self.find_by_resolved_url(contact.username)
|
||||||
|
for twitter_profile in url_resolved_matches:
|
||||||
|
if twitter_profile['id'] not in seen_twitter_ids:
|
||||||
|
resolved_url = twitter_profile.pop('resolved_telegram_url', None)
|
||||||
|
original_url = twitter_profile.pop('original_url', None)
|
||||||
|
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'twitter_bio_url_resolves_to_telegram',
|
||||||
|
'baseline_confidence': 0.95, # VERY high confidence - explicit URL link in bio
|
||||||
|
'match_details': {
|
||||||
|
'telegram_username': contact.username,
|
||||||
|
'twitter_username': twitter_profile['username'],
|
||||||
|
'resolved_url': resolved_url,
|
||||||
|
'original_url': original_url,
|
||||||
|
'found_via_url_resolution': True
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
# Method 6: Display name containment (TG name in TW name)
|
||||||
|
if display_name:
|
||||||
|
containment_matches = self.find_by_display_name_containment(display_name, limit=5)
|
||||||
|
for twitter_profile in containment_matches:
|
||||||
|
if twitter_profile['id'] not in seen_twitter_ids:
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'display_name_containment',
|
||||||
|
'baseline_confidence': 0.92, # High confidence for exact name containment
|
||||||
|
'match_details': {
|
||||||
|
'telegram_name': display_name,
|
||||||
|
'twitter_name': twitter_profile['name'],
|
||||||
|
'match_type': 'tg_name_contained_in_tw_name'
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
# Method 7: Fuzzy name match (always run to find additional candidates)
|
||||||
|
if display_name:
|
||||||
|
fuzzy_matches = self.find_by_fuzzy_name(display_name, limit=5)
|
||||||
|
for i, twitter_profile in enumerate(fuzzy_matches):
|
||||||
|
name_score = twitter_profile.pop('name_score')
|
||||||
|
|
||||||
|
# Calculate baseline confidence from name similarity
|
||||||
|
baseline_confidence = min(0.85, name_score) # Cap at 0.85 for fuzzy matches
|
||||||
|
|
||||||
|
if twitter_profile['id'] not in seen_twitter_ids:
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'fuzzy_name',
|
||||||
|
'baseline_confidence': baseline_confidence,
|
||||||
|
'match_details': {
|
||||||
|
'telegram_name': display_name,
|
||||||
|
'twitter_name': twitter_profile['name'],
|
||||||
|
'fuzzy_score': name_score,
|
||||||
|
'candidate_rank': i + 1
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
# Method 8: TG username in Twitter display name
|
||||||
|
if contact.username:
|
||||||
|
matches = self.find_by_username_in_display_name(contact.username, is_telegram=True, limit=3)
|
||||||
|
for twitter_profile in matches:
|
||||||
|
if twitter_profile['id'] not in seen_twitter_ids:
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'tg_username_in_twitter_name',
|
||||||
|
'baseline_confidence': 0.88,
|
||||||
|
'match_details': {
|
||||||
|
'telegram_username': contact.username,
|
||||||
|
'twitter_name': twitter_profile['name'],
|
||||||
|
'found_in_display_name': True
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
# Method 9: Twitter username in TG display name
|
||||||
|
if display_name:
|
||||||
|
matches = self.find_by_username_in_display_name(display_name, is_telegram=False, limit=3)
|
||||||
|
for twitter_profile in matches:
|
||||||
|
if twitter_profile['id'] not in seen_twitter_ids:
|
||||||
|
candidates.append({
|
||||||
|
'twitter_profile': twitter_profile,
|
||||||
|
'match_method': 'twitter_username_in_tg_name',
|
||||||
|
'baseline_confidence': 0.86,
|
||||||
|
'match_details': {
|
||||||
|
'telegram_name': display_name,
|
||||||
|
'twitter_username': twitter_profile['username'],
|
||||||
|
'username_in_display_name': True
|
||||||
|
}
|
||||||
|
})
|
||||||
|
seen_twitter_ids.add(twitter_profile['id'])
|
||||||
|
|
||||||
|
return candidates
|
||||||
|
|
||||||
|
|
||||||
|
def save_candidates_to_db(telegram_user_id: int, account_id: int, candidates: List[Dict], telegram_db):
|
||||||
|
"""Save candidates to twitter_match_candidates table"""
|
||||||
|
if not candidates:
|
||||||
|
return
|
||||||
|
|
||||||
|
insert_data = []
|
||||||
|
for cand in candidates:
|
||||||
|
tw = cand['twitter_profile']
|
||||||
|
insert_data.append((
|
||||||
|
account_id, # Add account_id
|
||||||
|
telegram_user_id,
|
||||||
|
tw['id'],
|
||||||
|
tw['username'],
|
||||||
|
tw['name'],
|
||||||
|
tw.get('description', ''),
|
||||||
|
tw.get('location'),
|
||||||
|
tw.get('verified', False),
|
||||||
|
tw.get('is_blue_verified', False),
|
||||||
|
tw.get('followers_count', 0),
|
||||||
|
cand.get('match_details', {}).get('candidate_rank', 1),
|
||||||
|
cand['match_method'],
|
||||||
|
cand['baseline_confidence'],
|
||||||
|
json.dumps(cand['match_details']) # Convert dict to JSON string
|
||||||
|
))
|
||||||
|
|
||||||
|
execute_values(telegram_db, """
|
||||||
|
INSERT INTO twitter_match_candidates (
|
||||||
|
account_id,
|
||||||
|
telegram_user_id,
|
||||||
|
twitter_id,
|
||||||
|
twitter_username,
|
||||||
|
twitter_name,
|
||||||
|
twitter_bio,
|
||||||
|
twitter_location,
|
||||||
|
twitter_verified,
|
||||||
|
twitter_blue_verified,
|
||||||
|
twitter_followers_count,
|
||||||
|
candidate_rank,
|
||||||
|
match_method,
|
||||||
|
baseline_confidence,
|
||||||
|
match_signals
|
||||||
|
) VALUES %s
|
||||||
|
ON CONFLICT DO NOTHING
|
||||||
|
""", insert_data, template="""(
|
||||||
|
%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s
|
||||||
|
)""")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print()
|
||||||
|
print("=" * 70)
|
||||||
|
print("🔍 Twitter-Telegram Candidate Finder")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Check arguments
|
||||||
|
test_mode = '--test' in sys.argv
|
||||||
|
limit = 100 if test_mode else None
|
||||||
|
|
||||||
|
# Check for --limit parameter
|
||||||
|
if '--limit' in sys.argv:
|
||||||
|
idx = sys.argv.index('--limit')
|
||||||
|
limit = int(sys.argv[idx + 1])
|
||||||
|
print(f"📊 LIMIT MODE: Processing first {limit:,} contacts")
|
||||||
|
print()
|
||||||
|
elif test_mode:
|
||||||
|
print("🧪 TEST MODE: Processing first 100 contacts only")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Connect to databases
|
||||||
|
print("📡 Connecting to databases...")
|
||||||
|
|
||||||
|
# Connect SQLAlchemy to localhost (not RDS)
|
||||||
|
from sqlalchemy import create_engine
|
||||||
|
from sqlalchemy.orm import sessionmaker
|
||||||
|
localhost_engine = create_engine('postgresql://andrewjiang@localhost:5432/telegram_contacts')
|
||||||
|
LocalSession = sessionmaker(bind=localhost_engine)
|
||||||
|
telegram_db = LocalSession()
|
||||||
|
|
||||||
|
try:
|
||||||
|
twitter_conn = psycopg2.connect(**TWITTER_DB_CONFIG)
|
||||||
|
twitter_conn.autocommit = False
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Failed to connect to Twitter database: {e}")
|
||||||
|
print(f" Config: {TWITTER_DB_CONFIG}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Also need psycopg2 connection to telegram DB for writing
|
||||||
|
try:
|
||||||
|
telegram_conn = psycopg2.connect(dbname='telegram_contacts', user='andrewjiang', host='localhost', port=5432)
|
||||||
|
telegram_conn.autocommit = False
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Failed to connect to Telegram database: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
print("✅ Connected to both databases")
|
||||||
|
print()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Initialize matcher (pass both connections for phash lookup)
|
||||||
|
telegram_psycopg_conn = psycopg2.connect(
|
||||||
|
dbname='telegram_contacts',
|
||||||
|
user='andrewjiang',
|
||||||
|
host='localhost'
|
||||||
|
)
|
||||||
|
matcher = TwitterMatcher(twitter_conn, telegram_psycopg_conn)
|
||||||
|
|
||||||
|
# Get Telegram contacts with bios or usernames
|
||||||
|
print("🔍 Loading Telegram contacts...")
|
||||||
|
query = telegram_db.query(Contact).filter(
|
||||||
|
Contact.user_id > 0, # Exclude channels
|
||||||
|
Contact.is_deleted == False,
|
||||||
|
Contact.is_bot == False
|
||||||
|
).filter(
|
||||||
|
(Contact.bio != None) | (Contact.username != None)
|
||||||
|
).order_by(Contact.user_id)
|
||||||
|
|
||||||
|
if limit:
|
||||||
|
query = query.limit(limit)
|
||||||
|
|
||||||
|
contacts = query.all()
|
||||||
|
print(f"✅ Found {len(contacts):,} contacts to process")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Process each contact
|
||||||
|
stats = {
|
||||||
|
'processed': 0,
|
||||||
|
'with_candidates': 0,
|
||||||
|
'total_candidates': 0,
|
||||||
|
'by_method': {}
|
||||||
|
}
|
||||||
|
|
||||||
|
print("🚀 Finding Twitter candidates...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
for i, contact in enumerate(contacts, 1):
|
||||||
|
candidates = matcher.find_candidates_for_contact(contact)
|
||||||
|
|
||||||
|
if candidates:
|
||||||
|
stats['with_candidates'] += 1
|
||||||
|
stats['total_candidates'] += len(candidates)
|
||||||
|
|
||||||
|
# Track by method
|
||||||
|
for cand in candidates:
|
||||||
|
method = cand['match_method']
|
||||||
|
stats['by_method'][method] = stats['by_method'].get(method, 0) + 1
|
||||||
|
|
||||||
|
# Save to database
|
||||||
|
with telegram_conn.cursor() as cur:
|
||||||
|
save_candidates_to_db(contact.user_id, contact.account_id, candidates, cur)
|
||||||
|
telegram_conn.commit()
|
||||||
|
|
||||||
|
stats['processed'] += 1
|
||||||
|
|
||||||
|
# Progress update every 10 contacts
|
||||||
|
if i % 10 == 0:
|
||||||
|
print(f" Processed {i:,}/{len(contacts):,} contacts... (candidates: {stats['with_candidates']:,}, total: {stats['total_candidates']:,})")
|
||||||
|
elif i == 1:
|
||||||
|
# Show first contact immediately
|
||||||
|
print(f" Processed 1/{len(contacts):,} contacts... (candidates: {stats['with_candidates']:,}, total: {stats['total_candidates']:,})")
|
||||||
|
|
||||||
|
# Final stats
|
||||||
|
print()
|
||||||
|
print("=" * 70)
|
||||||
|
print("✅ CANDIDATE FINDING COMPLETE")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
print(f"📊 Statistics:")
|
||||||
|
print(f" Processed: {stats['processed']:,} contacts")
|
||||||
|
print(f" With candidates: {stats['with_candidates']:,} ({stats['with_candidates']/stats['processed']*100:.1f}%)")
|
||||||
|
print(f" Total candidates: {stats['total_candidates']:,}")
|
||||||
|
print(f" Avg candidates per match: {stats['total_candidates']/max(stats['with_candidates'], 1):.1f}")
|
||||||
|
print()
|
||||||
|
print(f"📈 By method:")
|
||||||
|
for method, count in sorted(stats['by_method'].items(), key=lambda x: -x[1]):
|
||||||
|
print(f" {method}: {count:,}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Error: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
return False
|
||||||
|
|
||||||
|
finally:
|
||||||
|
telegram_db.close()
|
||||||
|
twitter_conn.close()
|
||||||
|
telegram_conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
try:
|
||||||
|
success = main()
|
||||||
|
sys.exit(0 if success else 1)
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n⚠️ Interrupted by user")
|
||||||
|
sys.exit(1)
|
||||||
389
find_twitter_candidates_threaded.py
Normal file
389
find_twitter_candidates_threaded.py
Normal file
@@ -0,0 +1,389 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Threading-based Twitter-Telegram Candidate Finder
|
||||||
|
|
||||||
|
Uses threading instead of multiprocessing for better macOS compatibility
|
||||||
|
and efficient I/O-bound parallel processing.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
import psycopg2
|
||||||
|
from psycopg2.extras import execute_values, DictCursor
|
||||||
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||||
|
import argparse
|
||||||
|
from typing import List, Dict, Tuple
|
||||||
|
import time
|
||||||
|
import threading
|
||||||
|
|
||||||
|
# Add parent directory to path
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
|
||||||
|
|
||||||
|
from sqlalchemy import create_engine
|
||||||
|
from sqlalchemy.orm import sessionmaker
|
||||||
|
from models import Contact
|
||||||
|
|
||||||
|
# Import the matcher from the original script
|
||||||
|
from find_twitter_candidates import TwitterMatcher, TWITTER_DB_CONFIG
|
||||||
|
|
||||||
|
# Database configuration
|
||||||
|
TELEGRAM_DB_URL = 'postgresql://andrewjiang@localhost:5432/telegram_contacts'
|
||||||
|
|
||||||
|
# Thread-local storage for database connections
|
||||||
|
thread_local = threading.local()
|
||||||
|
|
||||||
|
|
||||||
|
def get_thread_connections():
|
||||||
|
"""Get or create database connections for this thread"""
|
||||||
|
if not hasattr(thread_local, 'twitter_conn'):
|
||||||
|
thread_local.twitter_conn = psycopg2.connect(**TWITTER_DB_CONFIG)
|
||||||
|
thread_local.telegram_conn = psycopg2.connect(
|
||||||
|
dbname='telegram_contacts',
|
||||||
|
user='andrewjiang',
|
||||||
|
host='localhost',
|
||||||
|
port=5432
|
||||||
|
)
|
||||||
|
thread_local.matcher = TwitterMatcher(
|
||||||
|
thread_local.twitter_conn,
|
||||||
|
thread_local.telegram_conn
|
||||||
|
)
|
||||||
|
|
||||||
|
return thread_local.twitter_conn, thread_local.telegram_conn, thread_local.matcher
|
||||||
|
|
||||||
|
|
||||||
|
def process_contact(contact: Contact) -> Tuple[bool, int, List[Dict], Dict]:
|
||||||
|
"""
|
||||||
|
Process a single contact in a worker thread.
|
||||||
|
|
||||||
|
Returns: (success, num_candidates, candidates, method_stats)
|
||||||
|
"""
|
||||||
|
import sys
|
||||||
|
import io
|
||||||
|
|
||||||
|
# Capture stderr to detect silent failures
|
||||||
|
old_stderr = sys.stderr
|
||||||
|
sys.stderr = stderr_capture = io.StringIO()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get thread-local database connections
|
||||||
|
twitter_conn, telegram_conn, matcher = get_thread_connections()
|
||||||
|
|
||||||
|
# Find candidates
|
||||||
|
candidates = matcher.find_candidates_for_contact(contact)
|
||||||
|
|
||||||
|
# Check if any warnings were logged
|
||||||
|
stderr_content = stderr_capture.getvalue()
|
||||||
|
if stderr_content:
|
||||||
|
print(f" 🔍 Warnings for contact {contact.user_id}:")
|
||||||
|
print(f" {stderr_content}")
|
||||||
|
|
||||||
|
method_stats = {}
|
||||||
|
if candidates:
|
||||||
|
# Track method stats
|
||||||
|
for cand in candidates:
|
||||||
|
method = cand['match_method']
|
||||||
|
method_stats[method] = method_stats.get(method, 0) + 1
|
||||||
|
|
||||||
|
# Add contact info to each candidate for later insertion
|
||||||
|
for cand in candidates:
|
||||||
|
cand['telegram_user_id'] = contact.user_id
|
||||||
|
cand['account_id'] = contact.account_id
|
||||||
|
|
||||||
|
return True, len(candidates), candidates, method_stats
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" ⚠️ Error processing contact {contact.user_id}: {e}")
|
||||||
|
print(f" Contact details: username={contact.username}, display_name={getattr(contact, 'display_name', None)}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
return False, 0, [], {}
|
||||||
|
finally:
|
||||||
|
sys.stderr = old_stderr
|
||||||
|
|
||||||
|
|
||||||
|
def save_candidates_batch(candidates: List[Dict], telegram_conn):
|
||||||
|
"""Save a batch of candidates to database"""
|
||||||
|
if not candidates:
|
||||||
|
return 0
|
||||||
|
|
||||||
|
insert_data = []
|
||||||
|
for cand in candidates:
|
||||||
|
tw = cand['twitter_profile']
|
||||||
|
match_signals = cand.get('match_details', {})
|
||||||
|
|
||||||
|
insert_data.append((
|
||||||
|
cand['account_id'],
|
||||||
|
cand['telegram_user_id'],
|
||||||
|
tw['id'],
|
||||||
|
tw['username'],
|
||||||
|
tw.get('name'),
|
||||||
|
tw.get('description'),
|
||||||
|
tw.get('location'),
|
||||||
|
tw.get('verified', False),
|
||||||
|
tw.get('is_blue_verified', False),
|
||||||
|
tw.get('followers_count', 0),
|
||||||
|
0, # candidate_rank (will be set later if needed)
|
||||||
|
cand['match_method'],
|
||||||
|
cand['baseline_confidence'],
|
||||||
|
psycopg2.extras.Json(match_signals),
|
||||||
|
True, # needs_llm_review
|
||||||
|
False, # llm_processed
|
||||||
|
))
|
||||||
|
|
||||||
|
with telegram_conn.cursor() as cur:
|
||||||
|
execute_values(cur, """
|
||||||
|
INSERT INTO twitter_match_candidates (
|
||||||
|
account_id,
|
||||||
|
telegram_user_id,
|
||||||
|
twitter_id,
|
||||||
|
twitter_username,
|
||||||
|
twitter_name,
|
||||||
|
twitter_bio,
|
||||||
|
twitter_location,
|
||||||
|
twitter_verified,
|
||||||
|
twitter_blue_verified,
|
||||||
|
twitter_followers_count,
|
||||||
|
candidate_rank,
|
||||||
|
match_method,
|
||||||
|
baseline_confidence,
|
||||||
|
match_signals,
|
||||||
|
needs_llm_review,
|
||||||
|
llm_processed
|
||||||
|
) VALUES %s
|
||||||
|
ON CONFLICT (telegram_user_id, twitter_id) DO NOTHING
|
||||||
|
""", insert_data, page_size=1000)
|
||||||
|
|
||||||
|
telegram_conn.commit()
|
||||||
|
|
||||||
|
return len(insert_data)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description='Find Twitter candidates for Telegram contacts (threading-based)')
|
||||||
|
parser.add_argument('--limit', type=int, help='Limit number of Telegram contacts to process')
|
||||||
|
parser.add_argument('--test', action='store_true', help='Test mode: process first 100 contacts only')
|
||||||
|
parser.add_argument('--workers', type=int, default=8,
|
||||||
|
help='Number of worker threads (default: 8)')
|
||||||
|
parser.add_argument('--user-id-min', type=int, help='Minimum user_id to process (for parallel ranges)')
|
||||||
|
parser.add_argument('--user-id-max', type=int, help='Maximum user_id to process (for parallel ranges)')
|
||||||
|
parser.add_argument('--range-name', type=str, help='Name for this range (for logging)')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
num_workers = args.workers
|
||||||
|
limit = args.limit
|
||||||
|
user_id_min = args.user_id_min
|
||||||
|
user_id_max = args.user_id_max
|
||||||
|
range_name = args.range_name or "full"
|
||||||
|
|
||||||
|
print("=" * 70)
|
||||||
|
print(f"🔍 Twitter-Telegram Candidate Finder (THREADED) - Range: {range_name}")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
|
||||||
|
if args.test:
|
||||||
|
limit = 100
|
||||||
|
print("🧪 TEST MODE: Processing first 100 contacts only")
|
||||||
|
print()
|
||||||
|
elif limit:
|
||||||
|
print(f"📊 LIMIT MODE: Processing first {limit:,} contacts")
|
||||||
|
print()
|
||||||
|
|
||||||
|
if user_id_min is not None and user_id_max is not None:
|
||||||
|
print(f"📍 User ID Range: {user_id_min:,} to {user_id_max:,}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print(f"🧵 Worker threads: {num_workers}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Load contacts using raw psycopg2
|
||||||
|
print("📡 Loading Telegram contacts...")
|
||||||
|
|
||||||
|
conn = psycopg2.connect(
|
||||||
|
dbname='telegram_contacts',
|
||||||
|
user='andrewjiang',
|
||||||
|
host='localhost',
|
||||||
|
port=5432
|
||||||
|
)
|
||||||
|
|
||||||
|
with conn.cursor() as cur:
|
||||||
|
# First, get list of already processed contacts
|
||||||
|
cur.execute("""
|
||||||
|
SELECT DISTINCT telegram_user_id
|
||||||
|
FROM twitter_match_candidates
|
||||||
|
""")
|
||||||
|
already_processed = set(row[0] for row in cur.fetchall())
|
||||||
|
|
||||||
|
print(f"📋 Already processed: {len(already_processed):,} contacts (will skip)")
|
||||||
|
print()
|
||||||
|
|
||||||
|
query = """
|
||||||
|
SELECT account_id, user_id, display_name, first_name, last_name, username, phone, bio, is_bot, is_deleted
|
||||||
|
FROM contacts
|
||||||
|
WHERE user_id > 0
|
||||||
|
AND is_deleted = false
|
||||||
|
AND is_bot = false
|
||||||
|
AND (bio IS NOT NULL OR username IS NOT NULL)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Add user_id range filter if specified
|
||||||
|
if user_id_min is not None:
|
||||||
|
query += f" AND user_id >= {user_id_min}"
|
||||||
|
if user_id_max is not None:
|
||||||
|
query += f" AND user_id <= {user_id_max}"
|
||||||
|
|
||||||
|
query += " ORDER BY user_id"
|
||||||
|
|
||||||
|
if limit:
|
||||||
|
query += f" LIMIT {limit}"
|
||||||
|
|
||||||
|
cur.execute(query)
|
||||||
|
rows = cur.fetchall()
|
||||||
|
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
# Convert to Contact objects, skipping already processed
|
||||||
|
contacts = []
|
||||||
|
skipped = 0
|
||||||
|
for row in rows:
|
||||||
|
user_id = row[1]
|
||||||
|
|
||||||
|
# Skip if already processed
|
||||||
|
if user_id in already_processed:
|
||||||
|
skipped += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
contact = Contact(
|
||||||
|
account_id=row[0],
|
||||||
|
user_id=user_id,
|
||||||
|
display_name=row[2],
|
||||||
|
first_name=row[3],
|
||||||
|
last_name=row[4],
|
||||||
|
username=row[5],
|
||||||
|
phone=row[6],
|
||||||
|
bio=row[7],
|
||||||
|
is_bot=row[8],
|
||||||
|
is_deleted=row[9]
|
||||||
|
)
|
||||||
|
contacts.append(contact)
|
||||||
|
|
||||||
|
total_contacts = len(contacts)
|
||||||
|
print(f"✅ Found {total_contacts:,} NEW contacts to process (skipped {skipped:,} already done)")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("🚀 Processing contacts with thread pool...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
# Stats tracking
|
||||||
|
total_processed = 0
|
||||||
|
total_with_candidates = 0
|
||||||
|
all_candidates = []
|
||||||
|
combined_method_stats = {}
|
||||||
|
total_saved = 0
|
||||||
|
|
||||||
|
# Database connection for incremental saves
|
||||||
|
telegram_conn = psycopg2.connect(
|
||||||
|
dbname='telegram_contacts',
|
||||||
|
user='andrewjiang',
|
||||||
|
host='localhost',
|
||||||
|
port=5432
|
||||||
|
)
|
||||||
|
|
||||||
|
# Process with ThreadPoolExecutor
|
||||||
|
with ThreadPoolExecutor(max_workers=num_workers) as executor:
|
||||||
|
# Submit all tasks
|
||||||
|
future_to_contact = {
|
||||||
|
executor.submit(process_contact, contact): contact
|
||||||
|
for contact in contacts
|
||||||
|
}
|
||||||
|
|
||||||
|
# Process results as they complete
|
||||||
|
for i, future in enumerate(as_completed(future_to_contact), 1):
|
||||||
|
try:
|
||||||
|
success, num_candidates, candidates, method_stats = future.result()
|
||||||
|
|
||||||
|
if success:
|
||||||
|
total_processed += 1
|
||||||
|
|
||||||
|
if candidates:
|
||||||
|
total_with_candidates += 1
|
||||||
|
all_candidates.extend(candidates)
|
||||||
|
|
||||||
|
# Update method stats
|
||||||
|
for method, count in method_stats.items():
|
||||||
|
combined_method_stats[method] = combined_method_stats.get(method, 0) + count
|
||||||
|
|
||||||
|
# Save to database every 100 contacts
|
||||||
|
if len(all_candidates) >= 100:
|
||||||
|
saved = save_candidates_batch(all_candidates, telegram_conn)
|
||||||
|
total_saved += saved
|
||||||
|
all_candidates = [] # Clear buffer
|
||||||
|
|
||||||
|
# Progress update every 10 contacts
|
||||||
|
if i % 10 == 0 or i == 1:
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
rate = total_processed / elapsed if elapsed > 0 else 0
|
||||||
|
remaining = total_contacts - total_processed
|
||||||
|
eta_seconds = remaining / rate if rate > 0 else 0
|
||||||
|
eta_hours = eta_seconds / 3600
|
||||||
|
|
||||||
|
print(f" Progress: {total_processed}/{total_contacts} ({total_processed/total_contacts*100:.1f}%) | "
|
||||||
|
f"{total_saved + len(all_candidates):,} candidates (💾 {total_saved:,} saved) | "
|
||||||
|
f"Rate: {rate:.1f}/sec | "
|
||||||
|
f"ETA: {eta_hours:.1f}h")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" ❌ Error processing future: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
# Save any remaining candidates
|
||||||
|
if all_candidates:
|
||||||
|
saved = save_candidates_batch(all_candidates, telegram_conn)
|
||||||
|
total_saved += saved
|
||||||
|
|
||||||
|
telegram_conn.close()
|
||||||
|
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
|
||||||
|
print()
|
||||||
|
print(f"⏱️ Processing completed in {elapsed:.1f} seconds ({elapsed/60:.1f} minutes)")
|
||||||
|
print(f" Rate: {total_processed/elapsed:.1f} contacts/second")
|
||||||
|
print()
|
||||||
|
print(f"💾 Total saved: {total_saved:,} candidates")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Print stats
|
||||||
|
print("=" * 70)
|
||||||
|
print("✅ CANDIDATE FINDING COMPLETE")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
print(f"📊 Statistics:")
|
||||||
|
print(f" Processed: {total_processed:,} contacts")
|
||||||
|
print(f" With candidates: {total_with_candidates:,} ({total_with_candidates/total_processed*100:.1f}%)")
|
||||||
|
print(f" Total candidates: {total_saved:,}")
|
||||||
|
if total_with_candidates > 0:
|
||||||
|
print(f" Avg candidates per match: {total_saved/total_with_candidates:.1f}")
|
||||||
|
print()
|
||||||
|
print(f"📈 By method:")
|
||||||
|
for method, count in sorted(combined_method_stats.items(), key=lambda x: x[1], reverse=True):
|
||||||
|
print(f" {method}: {count:,}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
try:
|
||||||
|
success = main()
|
||||||
|
sys.exit(0 if success else 1)
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n❌ Interrupted by user")
|
||||||
|
sys.exit(1)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n❌ Error: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
sys.exit(1)
|
||||||
225
main.py
Executable file
225
main.py
Executable file
@@ -0,0 +1,225 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Twitter-Telegram Profile Matching System
|
||||||
|
Main menu for finding candidates and verifying matches with LLM
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
import psycopg2
|
||||||
|
|
||||||
|
# Add parent directory to path for imports
|
||||||
|
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', 'src'))
|
||||||
|
|
||||||
|
# Database configuration
|
||||||
|
DB_CONFIG = {
|
||||||
|
'dbname': 'telegram_contacts',
|
||||||
|
'user': 'andrewjiang',
|
||||||
|
'host': 'localhost',
|
||||||
|
'port': 5432
|
||||||
|
}
|
||||||
|
|
||||||
|
TWITTER_DB_CONFIG = {
|
||||||
|
'dbname': 'twitter_data',
|
||||||
|
'user': 'andrewjiang',
|
||||||
|
'host': 'localhost',
|
||||||
|
'port': 5432
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def get_stats():
|
||||||
|
"""Get current matching statistics"""
|
||||||
|
conn = psycopg2.connect(**DB_CONFIG)
|
||||||
|
cur = conn.cursor()
|
||||||
|
|
||||||
|
stats = {}
|
||||||
|
|
||||||
|
# Candidates stats
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
COUNT(DISTINCT telegram_user_id) as total_users,
|
||||||
|
COUNT(*) as total_candidates,
|
||||||
|
COUNT(*) FILTER (WHERE llm_processed = TRUE) as processed_candidates,
|
||||||
|
COUNT(*) FILTER (WHERE llm_processed = FALSE) as pending_candidates
|
||||||
|
FROM twitter_match_candidates
|
||||||
|
""")
|
||||||
|
row = cur.fetchone()
|
||||||
|
stats['total_users'] = row[0]
|
||||||
|
stats['total_candidates'] = row[1]
|
||||||
|
stats['processed_candidates'] = row[2]
|
||||||
|
stats['pending_candidates'] = row[3]
|
||||||
|
|
||||||
|
# Matches stats
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
COUNT(*) as total_matches,
|
||||||
|
AVG(final_confidence) as avg_confidence,
|
||||||
|
COUNT(*) FILTER (WHERE final_confidence >= 0.90) as high_conf,
|
||||||
|
COUNT(*) FILTER (WHERE final_confidence >= 0.80 AND final_confidence < 0.90) as med_conf,
|
||||||
|
COUNT(*) FILTER (WHERE final_confidence >= 0.70 AND final_confidence < 0.80) as low_conf
|
||||||
|
FROM twitter_telegram_matches
|
||||||
|
""")
|
||||||
|
row = cur.fetchone()
|
||||||
|
stats['total_matches'] = row[0]
|
||||||
|
stats['avg_confidence'] = row[1] or 0
|
||||||
|
stats['high_conf'] = row[2]
|
||||||
|
stats['med_conf'] = row[3]
|
||||||
|
stats['low_conf'] = row[4]
|
||||||
|
|
||||||
|
# Users with matches
|
||||||
|
cur.execute("""
|
||||||
|
SELECT COUNT(DISTINCT telegram_user_id)
|
||||||
|
FROM twitter_telegram_matches
|
||||||
|
""")
|
||||||
|
stats['users_with_matches'] = cur.fetchone()[0]
|
||||||
|
|
||||||
|
cur.close()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
return stats
|
||||||
|
|
||||||
|
|
||||||
|
def print_header():
|
||||||
|
"""Print main header"""
|
||||||
|
print()
|
||||||
|
print("=" * 80)
|
||||||
|
print("🔗 Twitter-Telegram Profile Matching System")
|
||||||
|
print("=" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
|
||||||
|
def print_stats():
|
||||||
|
"""Print current statistics"""
|
||||||
|
stats = get_stats()
|
||||||
|
|
||||||
|
print("📊 Current Statistics:")
|
||||||
|
print("-" * 80)
|
||||||
|
print(f"Candidates:")
|
||||||
|
print(f" • Users with candidates: {stats['total_users']:,}")
|
||||||
|
print(f" • Total candidates found: {stats['total_candidates']:,}")
|
||||||
|
print(f" • Processed by LLM: {stats['processed_candidates']:,}")
|
||||||
|
print(f" • Pending verification: {stats['pending_candidates']:,}")
|
||||||
|
print()
|
||||||
|
print(f"Verified Matches:")
|
||||||
|
print(f" • Users with matches: {stats['users_with_matches']:,}")
|
||||||
|
print(f" • Total matches: {stats['total_matches']:,}")
|
||||||
|
print(f" • Average confidence: {stats['avg_confidence']:.2f}")
|
||||||
|
print(f" • High confidence (90%+): {stats['high_conf']:,}")
|
||||||
|
print(f" • Medium confidence (80-89%): {stats['med_conf']:,}")
|
||||||
|
print(f" • Low confidence (70-79%): {stats['low_conf']:,}")
|
||||||
|
print("-" * 80)
|
||||||
|
print()
|
||||||
|
|
||||||
|
|
||||||
|
def run_script(script_name, *args):
|
||||||
|
"""Run a Python script with arguments"""
|
||||||
|
script_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), f"{script_name}.py")
|
||||||
|
cmd = ['python3.10', script_path] + list(args)
|
||||||
|
subprocess.run(cmd)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
while True:
|
||||||
|
print_header()
|
||||||
|
print_stats()
|
||||||
|
|
||||||
|
print("📋 Main Menu:")
|
||||||
|
print()
|
||||||
|
print("STEP 1: Find Candidates")
|
||||||
|
print(" 1. Find Twitter candidates (threaded, RECOMMENDED)")
|
||||||
|
print(" 2. Find Twitter candidates (single-threaded)")
|
||||||
|
print()
|
||||||
|
print("STEP 2: Verify with LLM")
|
||||||
|
print(" 3. Verify matches with LLM (async, RECOMMENDED)")
|
||||||
|
print(" 4. Verify matches with LLM (test mode - 50 users)")
|
||||||
|
print()
|
||||||
|
print("Analysis & Review")
|
||||||
|
print(" 5. Review match quality")
|
||||||
|
print(" 6. Show statistics only")
|
||||||
|
print()
|
||||||
|
print(" 0. Exit")
|
||||||
|
print()
|
||||||
|
|
||||||
|
choice = input("👉 Enter your choice: ").strip()
|
||||||
|
|
||||||
|
if choice == '0':
|
||||||
|
print("\n👋 Goodbye!\n")
|
||||||
|
break
|
||||||
|
|
||||||
|
elif choice == '1':
|
||||||
|
# Find candidates (threaded)
|
||||||
|
print()
|
||||||
|
print("🔍 Finding Twitter candidates (threaded mode)...")
|
||||||
|
print()
|
||||||
|
limit_input = input("👉 How many contacts? (press Enter for all): ").strip()
|
||||||
|
workers = input("👉 Number of worker threads (default: 8): ").strip() or '8'
|
||||||
|
|
||||||
|
if limit_input:
|
||||||
|
run_script('find_twitter_candidates_threaded', '--limit', limit_input, '--workers', workers)
|
||||||
|
else:
|
||||||
|
run_script('find_twitter_candidates_threaded', '--workers', workers)
|
||||||
|
|
||||||
|
input("\n✅ Press Enter to continue...")
|
||||||
|
|
||||||
|
elif choice == '2':
|
||||||
|
# Find candidates (single-threaded)
|
||||||
|
print()
|
||||||
|
print("🔍 Finding Twitter candidates (single-threaded mode)...")
|
||||||
|
print()
|
||||||
|
limit_input = input("👉 How many contacts? (press Enter for all): ").strip()
|
||||||
|
|
||||||
|
if limit_input:
|
||||||
|
run_script('find_twitter_candidates', '--limit', limit_input)
|
||||||
|
else:
|
||||||
|
run_script('find_twitter_candidates')
|
||||||
|
|
||||||
|
input("\n✅ Press Enter to continue...")
|
||||||
|
|
||||||
|
elif choice == '3':
|
||||||
|
# Verify with LLM (async)
|
||||||
|
print()
|
||||||
|
print("🤖 Verifying matches with LLM (async mode)...")
|
||||||
|
print()
|
||||||
|
concurrent = input("👉 Concurrent requests (default: 100): ").strip() or '100'
|
||||||
|
|
||||||
|
run_script('verify_twitter_matches_v2', '--verbose', '--concurrent', concurrent)
|
||||||
|
|
||||||
|
input("\n✅ Press Enter to continue...")
|
||||||
|
|
||||||
|
elif choice == '4':
|
||||||
|
# Verify with LLM (test mode)
|
||||||
|
print()
|
||||||
|
print("🧪 Test mode: Verifying 50 users with LLM...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
run_script('verify_twitter_matches_v2', '--test', '--limit', '50', '--verbose', '--concurrent', '10')
|
||||||
|
|
||||||
|
input("\n✅ Press Enter to continue...")
|
||||||
|
|
||||||
|
elif choice == '5':
|
||||||
|
# Review match quality
|
||||||
|
print()
|
||||||
|
print("📊 Reviewing match quality...")
|
||||||
|
print()
|
||||||
|
|
||||||
|
run_script('review_match_quality')
|
||||||
|
|
||||||
|
input("\n✅ Press Enter to continue...")
|
||||||
|
|
||||||
|
elif choice == '6':
|
||||||
|
# Just show stats, loop back to menu
|
||||||
|
continue
|
||||||
|
|
||||||
|
else:
|
||||||
|
print("\n❌ Invalid choice. Please try again.\n")
|
||||||
|
input("Press Enter to continue...")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
try:
|
||||||
|
main()
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n👋 Interrupted. Goodbye!\n")
|
||||||
|
sys.exit(0)
|
||||||
406
review_match_quality.py
Executable file
406
review_match_quality.py
Executable file
@@ -0,0 +1,406 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Critical Match Quality Reviewer
|
||||||
|
|
||||||
|
Analyzes verification results with deep understanding of:
|
||||||
|
- Twitter/Telegram/crypto culture
|
||||||
|
- Common false positive patterns
|
||||||
|
- Company vs personal account indicators
|
||||||
|
- Context alignment
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Dict, Tuple
|
||||||
|
|
||||||
|
LOG_FILE = Path(__file__).parent.parent / 'verification_v2_log.txt'
|
||||||
|
|
||||||
|
|
||||||
|
class MatchReviewer:
|
||||||
|
"""Critical reviewer with crypto/web3 domain knowledge"""
|
||||||
|
|
||||||
|
# Company/product/team indicators
|
||||||
|
COMPANY_INDICATORS = [
|
||||||
|
r'\bis\s+(a|an)\s+\w+\s+(way|tool|platform|app|service|protocol)',
|
||||||
|
r'official\s+(account|page|channel)',
|
||||||
|
r'(team|official)\s*$',
|
||||||
|
r'^(the|a)\s+\w+\s+(for|to)',
|
||||||
|
r'brought to you by',
|
||||||
|
r'hosted by',
|
||||||
|
r'founded by',
|
||||||
|
r'(community|project)\s+(account|page)',
|
||||||
|
r'(dao|protocol|network)\s*$',
|
||||||
|
r'building\s+(the|a)\s+',
|
||||||
|
]
|
||||||
|
|
||||||
|
# Personal account indicators
|
||||||
|
PERSONAL_INDICATORS = [
|
||||||
|
r'(ceo|founder|co-founder|developer|builder|engineer|researcher)\s+at',
|
||||||
|
r'working\s+(at|on|with)',
|
||||||
|
r'^(i|i\'m|my)',
|
||||||
|
r'(my|personal)\s+(views|opinions|thoughts)',
|
||||||
|
r'\b(he/him|she/her|they/them)\b',
|
||||||
|
]
|
||||||
|
|
||||||
|
# Crypto/Web3 keywords
|
||||||
|
CRYPTO_KEYWORDS = [
|
||||||
|
'web3', 'crypto', 'blockchain', 'defi', 'nft', 'dao', 'dapp',
|
||||||
|
'ethereum', 'solana', 'bitcoin', 'polygon', 'base', 'arbitrum',
|
||||||
|
'smart contract', 'token', 'wallet', 'metamask', 'coinbase',
|
||||||
|
'yield', 'farming', 'staking', 'airdrop', 'whitelist', 'mint',
|
||||||
|
'protocol', 'l1', 'l2', 'rollup', 'zk', 'evm'
|
||||||
|
]
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.issues = []
|
||||||
|
self.stats = {
|
||||||
|
'total_matches': 0,
|
||||||
|
'false_positives': 0,
|
||||||
|
'questionable': 0,
|
||||||
|
'good_matches': 0,
|
||||||
|
'company_accounts': 0,
|
||||||
|
'weak_evidence': 0,
|
||||||
|
'context_mismatch': 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
def parse_log(self) -> List[Dict]:
|
||||||
|
"""Parse the verification log into structured data"""
|
||||||
|
with open(LOG_FILE, 'r') as f:
|
||||||
|
content = f.read()
|
||||||
|
|
||||||
|
entries = []
|
||||||
|
# Find all sections that start with TELEGRAM USER
|
||||||
|
pattern = r'TELEGRAM USER: ([^\(]+) \(ID: (\d+)\)(.*?)(?=TELEGRAM USER:|$)'
|
||||||
|
matches = re.finditer(pattern, content, re.DOTALL)
|
||||||
|
|
||||||
|
for match in matches:
|
||||||
|
tg_username = match.group(1).strip()
|
||||||
|
tg_id = int(match.group(2))
|
||||||
|
section = match.group(3)
|
||||||
|
|
||||||
|
# Extract TG profile
|
||||||
|
tg_bio_match = re.search(r'Bio: (.*?)(?:\n+TWITTER CANDIDATES)', section, re.DOTALL)
|
||||||
|
tg_bio = tg_bio_match.group(1).strip() if tg_bio_match else ''
|
||||||
|
|
||||||
|
# Extract username specificity
|
||||||
|
spec_match = re.search(r'username has specificity score ([\d.]+)', section)
|
||||||
|
specificity = float(spec_match.group(1)) if spec_match else 0.5
|
||||||
|
|
||||||
|
# Extract candidates
|
||||||
|
candidates = []
|
||||||
|
candidate_blocks = re.findall(
|
||||||
|
r'\[Candidate (\d+)\](.*?)(?=\[Candidate \d+\]|LLM RESPONSE:)',
|
||||||
|
section,
|
||||||
|
re.DOTALL
|
||||||
|
)
|
||||||
|
|
||||||
|
for idx, block in candidate_blocks:
|
||||||
|
tw_username = re.search(r'Twitter Username: @(\S+)', block)
|
||||||
|
tw_name = re.search(r'Twitter Display Name: (.+)', block)
|
||||||
|
tw_bio = re.search(r'Twitter Bio: (.*?)(?=\nLocation:)', block, re.DOTALL)
|
||||||
|
tw_followers = re.search(r'Followers: ([\d,]+)', block)
|
||||||
|
match_method = re.search(r'Match Method: (\S+)', block)
|
||||||
|
baseline_conf = re.search(r'Baseline Confidence: ([\d.]+)', block)
|
||||||
|
|
||||||
|
candidates.append({
|
||||||
|
'index': int(idx),
|
||||||
|
'twitter_username': tw_username.group(1) if tw_username else '',
|
||||||
|
'twitter_name': tw_name.group(1).strip() if tw_name else '',
|
||||||
|
'twitter_bio': tw_bio.group(1).strip() if tw_bio else '',
|
||||||
|
'twitter_followers': tw_followers.group(1) if tw_followers else '0',
|
||||||
|
'match_method': match_method.group(1) if match_method else '',
|
||||||
|
'baseline_confidence': float(baseline_conf.group(1)) if baseline_conf else 0.0,
|
||||||
|
})
|
||||||
|
|
||||||
|
# Extract LLM response (handle multiline JSON with nested structures)
|
||||||
|
llm_match = re.search(r'LLM RESPONSE:\s*-+\s*(\{.*)', section, re.DOTALL)
|
||||||
|
if llm_match:
|
||||||
|
try:
|
||||||
|
# Extract JSON - it should be everything after "LLM RESPONSE:" until end of section
|
||||||
|
json_text = llm_match.group(1)
|
||||||
|
# Find the JSON object (balanced braces)
|
||||||
|
brace_count = 0
|
||||||
|
json_end = 0
|
||||||
|
for i, char in enumerate(json_text):
|
||||||
|
if char == '{':
|
||||||
|
brace_count += 1
|
||||||
|
elif char == '}':
|
||||||
|
brace_count -= 1
|
||||||
|
if brace_count == 0:
|
||||||
|
json_end = i + 1
|
||||||
|
break
|
||||||
|
|
||||||
|
if json_end > 0:
|
||||||
|
json_str = json_text[:json_end]
|
||||||
|
llm_result = json.loads(json_str)
|
||||||
|
else:
|
||||||
|
llm_result = {'candidates': []}
|
||||||
|
except Exception as e:
|
||||||
|
llm_result = {'candidates': []}
|
||||||
|
else:
|
||||||
|
llm_result = {'candidates': []}
|
||||||
|
|
||||||
|
entries.append({
|
||||||
|
'telegram_username': tg_username,
|
||||||
|
'telegram_id': tg_id,
|
||||||
|
'telegram_bio': tg_bio,
|
||||||
|
'username_specificity': specificity,
|
||||||
|
'candidates': candidates,
|
||||||
|
'llm_results': llm_result.get('candidates', [])
|
||||||
|
})
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def is_company_account(self, bio: str, name: str) -> Tuple[bool, str]:
|
||||||
|
"""Detect if this is a company/product/team account"""
|
||||||
|
text = (bio + ' ' + name).lower()
|
||||||
|
|
||||||
|
for pattern in self.COMPANY_INDICATORS:
|
||||||
|
if re.search(pattern, text, re.IGNORECASE):
|
||||||
|
return True, f"Company pattern: '{pattern}'"
|
||||||
|
|
||||||
|
# Check if name equals bio description
|
||||||
|
if bio and len(bio.split()) < 20:
|
||||||
|
# Short bio describing what something "is"
|
||||||
|
if re.search(r'\bis\s+(a|an|the)\s+', bio, re.IGNORECASE):
|
||||||
|
return True, "Bio describes a product/service"
|
||||||
|
|
||||||
|
return False, ""
|
||||||
|
|
||||||
|
def is_personal_account(self, bio: str) -> bool:
|
||||||
|
"""Detect personal account indicators"""
|
||||||
|
for pattern in self.PERSONAL_INDICATORS:
|
||||||
|
if re.search(pattern, bio, re.IGNORECASE):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
def has_crypto_context(self, bio: str) -> Tuple[bool, List[str]]:
|
||||||
|
"""Check if bio has crypto/web3 context"""
|
||||||
|
if not bio:
|
||||||
|
return False, []
|
||||||
|
|
||||||
|
bio_lower = bio.lower()
|
||||||
|
found_keywords = []
|
||||||
|
|
||||||
|
for keyword in self.CRYPTO_KEYWORDS:
|
||||||
|
if keyword in bio_lower:
|
||||||
|
found_keywords.append(keyword)
|
||||||
|
|
||||||
|
return len(found_keywords) > 0, found_keywords
|
||||||
|
|
||||||
|
def review_match(self, entry: Dict) -> Dict:
|
||||||
|
"""Critically review a single match"""
|
||||||
|
issues = []
|
||||||
|
severity = 'GOOD'
|
||||||
|
|
||||||
|
tg_username = entry['telegram_username']
|
||||||
|
tg_bio = entry['telegram_bio']
|
||||||
|
tg_has_crypto, tg_crypto_keywords = self.has_crypto_context(tg_bio)
|
||||||
|
|
||||||
|
# Review each LLM-approved match (confidence >= 0.5)
|
||||||
|
for llm_result in entry['llm_results']:
|
||||||
|
confidence = llm_result.get('confidence', 0)
|
||||||
|
if confidence < 0.5:
|
||||||
|
continue
|
||||||
|
|
||||||
|
self.stats['total_matches'] += 1
|
||||||
|
candidate_idx = llm_result.get('candidate_index', 0) - 1
|
||||||
|
|
||||||
|
if candidate_idx < 0 or candidate_idx >= len(entry['candidates']):
|
||||||
|
continue
|
||||||
|
|
||||||
|
candidate = entry['candidates'][candidate_idx]
|
||||||
|
tw_username = candidate['twitter_username']
|
||||||
|
tw_bio = candidate['twitter_bio']
|
||||||
|
tw_name = candidate['twitter_name']
|
||||||
|
match_method = candidate['match_method']
|
||||||
|
|
||||||
|
# Check 1: Company account
|
||||||
|
is_company, company_reason = self.is_company_account(tw_bio, tw_name)
|
||||||
|
if is_company:
|
||||||
|
issues.append({
|
||||||
|
'type': 'COMPANY_ACCOUNT',
|
||||||
|
'severity': 'HIGH',
|
||||||
|
'description': f"Twitter @{tw_username} appears to be a company/product account",
|
||||||
|
'evidence': company_reason,
|
||||||
|
'confidence': confidence
|
||||||
|
})
|
||||||
|
self.stats['company_accounts'] += 1
|
||||||
|
severity = 'FALSE_POSITIVE'
|
||||||
|
|
||||||
|
# Check 2: Context mismatch
|
||||||
|
tw_has_crypto, tw_crypto_keywords = self.has_crypto_context(tw_bio)
|
||||||
|
|
||||||
|
if tg_has_crypto and not tw_has_crypto:
|
||||||
|
issues.append({
|
||||||
|
'type': 'CONTEXT_MISMATCH',
|
||||||
|
'severity': 'MEDIUM',
|
||||||
|
'description': f"TG has crypto context but TW doesn't",
|
||||||
|
'evidence': f"TG keywords: {tg_crypto_keywords}, TW keywords: none",
|
||||||
|
'confidence': confidence
|
||||||
|
})
|
||||||
|
self.stats['context_mismatch'] += 1
|
||||||
|
if severity == 'GOOD':
|
||||||
|
severity = 'QUESTIONABLE'
|
||||||
|
|
||||||
|
# Check 3: Empty bio with no strong evidence
|
||||||
|
if not tg_bio and not tw_bio and confidence > 0.8:
|
||||||
|
issues.append({
|
||||||
|
'type': 'WEAK_EVIDENCE',
|
||||||
|
'severity': 'MEDIUM',
|
||||||
|
'description': f"High confidence ({confidence}) with both bios empty",
|
||||||
|
'evidence': f"Only username match, no contextual verification",
|
||||||
|
'confidence': confidence
|
||||||
|
})
|
||||||
|
self.stats['weak_evidence'] += 1
|
||||||
|
if severity == 'GOOD':
|
||||||
|
severity = 'QUESTIONABLE'
|
||||||
|
|
||||||
|
# Check 4: Generic username with high confidence
|
||||||
|
if entry['username_specificity'] < 0.6 and confidence > 0.85:
|
||||||
|
issues.append({
|
||||||
|
'type': 'GENERIC_USERNAME',
|
||||||
|
'severity': 'LOW',
|
||||||
|
'description': f"Generic username ({entry['username_specificity']:.2f} specificity) with high confidence",
|
||||||
|
'evidence': f"Username: {tg_username}",
|
||||||
|
'confidence': confidence
|
||||||
|
})
|
||||||
|
if severity == 'GOOD':
|
||||||
|
severity = 'QUESTIONABLE'
|
||||||
|
|
||||||
|
# Check 5: Twitter bio mentions other accounts
|
||||||
|
if match_method == 'twitter_bio_has_telegram':
|
||||||
|
# Check if the telegram username appears as @mention (not the account itself)
|
||||||
|
mentions = re.findall(r'@(\w+)', tw_bio)
|
||||||
|
if tg_username.lower() not in [m.lower() for m in mentions]:
|
||||||
|
# The username is embedded in another handle
|
||||||
|
issues.append({
|
||||||
|
'type': 'SUBSTRING_MATCH',
|
||||||
|
'severity': 'HIGH',
|
||||||
|
'description': f"TG username found as substring in other accounts, not direct mention",
|
||||||
|
'evidence': f"TW bio: {tw_bio[:100]}",
|
||||||
|
'confidence': confidence
|
||||||
|
})
|
||||||
|
self.stats['false_positives'] += 1
|
||||||
|
severity = 'FALSE_POSITIVE'
|
||||||
|
|
||||||
|
# Count severity
|
||||||
|
if severity == 'FALSE_POSITIVE':
|
||||||
|
self.stats['false_positives'] += 1
|
||||||
|
elif severity == 'QUESTIONABLE':
|
||||||
|
self.stats['questionable'] += 1
|
||||||
|
else:
|
||||||
|
self.stats['good_matches'] += 1
|
||||||
|
|
||||||
|
return {
|
||||||
|
'telegram_username': tg_username,
|
||||||
|
'telegram_id': entry['telegram_id'],
|
||||||
|
'severity': severity,
|
||||||
|
'issues': issues,
|
||||||
|
'entry': entry
|
||||||
|
}
|
||||||
|
|
||||||
|
def generate_report(self, reviews: List[Dict]):
|
||||||
|
"""Generate comprehensive review report"""
|
||||||
|
print()
|
||||||
|
print("=" * 100)
|
||||||
|
print("🔍 MATCH QUALITY REVIEW REPORT")
|
||||||
|
print("=" * 100)
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("📊 STATISTICS:")
|
||||||
|
print(f" Total matches reviewed: {self.stats['total_matches']}")
|
||||||
|
print(f" ✅ Good matches: {self.stats['good_matches']} ({self.stats['good_matches']/max(self.stats['total_matches'],1)*100:.1f}%)")
|
||||||
|
print(f" ⚠️ Questionable: {self.stats['questionable']} ({self.stats['questionable']/max(self.stats['total_matches'],1)*100:.1f}%)")
|
||||||
|
print(f" ❌ False positives: {self.stats['false_positives']} ({self.stats['false_positives']/max(self.stats['total_matches'],1)*100:.1f}%)")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("🚨 ISSUE BREAKDOWN:")
|
||||||
|
print(f" Company accounts: {self.stats['company_accounts']}")
|
||||||
|
print(f" Context mismatches: {self.stats['context_mismatch']}")
|
||||||
|
print(f" Weak evidence: {self.stats['weak_evidence']}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Show false positives
|
||||||
|
false_positives = [r for r in reviews if r['severity'] == 'FALSE_POSITIVE']
|
||||||
|
if false_positives:
|
||||||
|
print("=" * 100)
|
||||||
|
print("❌ FALSE POSITIVES:")
|
||||||
|
print("=" * 100)
|
||||||
|
for review in false_positives[:10]: # Show top 10
|
||||||
|
print()
|
||||||
|
print(f"TG @{review['telegram_username']} (ID: {review['telegram_id']})")
|
||||||
|
print(f"TG Bio: {review['entry']['telegram_bio'][:100]}")
|
||||||
|
for issue in review['issues']:
|
||||||
|
print(f" ❌ [{issue['severity']}] {issue['type']}: {issue['description']}")
|
||||||
|
print(f" Evidence: {issue['evidence'][:150]}")
|
||||||
|
print(f" LLM Confidence: {issue['confidence']:.2f}")
|
||||||
|
|
||||||
|
# Show questionable matches
|
||||||
|
questionable = [r for r in reviews if r['severity'] == 'QUESTIONABLE']
|
||||||
|
if questionable:
|
||||||
|
print()
|
||||||
|
print("=" * 100)
|
||||||
|
print("⚠️ QUESTIONABLE MATCHES:")
|
||||||
|
print("=" * 100)
|
||||||
|
for review in questionable[:10]: # Show top 10
|
||||||
|
print()
|
||||||
|
print(f"TG @{review['telegram_username']} (ID: {review['telegram_id']})")
|
||||||
|
for issue in review['issues']:
|
||||||
|
print(f" ⚠️ [{issue['severity']}] {issue['type']}: {issue['description']}")
|
||||||
|
print(f" Evidence: {issue['evidence'][:150]}")
|
||||||
|
print(f" LLM Confidence: {issue['confidence']:.2f}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("=" * 100)
|
||||||
|
print("💡 RECOMMENDATIONS:")
|
||||||
|
print("=" * 100)
|
||||||
|
print()
|
||||||
|
|
||||||
|
if self.stats['company_accounts'] > 0:
|
||||||
|
print("1. Add company account detection to prompt:")
|
||||||
|
print(" - Check for product descriptions ('X is a platform for...')")
|
||||||
|
print(" - Look for 'official', 'team', 'hosted by' patterns")
|
||||||
|
print(" - Distinguish personal vs organizational accounts")
|
||||||
|
print()
|
||||||
|
|
||||||
|
if self.stats['context_mismatch'] > 0:
|
||||||
|
print("2. Strengthen context matching:")
|
||||||
|
print(" - Require crypto/web3 keywords in both profiles")
|
||||||
|
print(" - Lower confidence when contexts don't align")
|
||||||
|
print()
|
||||||
|
|
||||||
|
if self.stats['weak_evidence'] > 0:
|
||||||
|
print("3. Adjust confidence for weak evidence:")
|
||||||
|
print(" - Cap confidence at 0.70 when both bios are empty")
|
||||||
|
print(" - Require additional signals beyond username match")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("4. Fix 'twitter_bio_has_telegram' method:")
|
||||||
|
print(" - Only match direct @mentions, not substrings in other handles")
|
||||||
|
print(" - Example: @hipster should NOT match mentions of @HipsterHacker")
|
||||||
|
print()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
reviewer = MatchReviewer()
|
||||||
|
|
||||||
|
print("📖 Parsing verification log...")
|
||||||
|
entries = reviewer.parse_log()
|
||||||
|
print(f"✅ Parsed {len(entries)} verification entries")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("🔍 Reviewing match quality...")
|
||||||
|
reviews = []
|
||||||
|
for entry in entries:
|
||||||
|
if entry['llm_results']: # Only review entries with matches
|
||||||
|
review = reviewer.review_match(entry)
|
||||||
|
reviews.append(review)
|
||||||
|
print(f"✅ Reviewed {len(reviews)} matches")
|
||||||
|
|
||||||
|
reviewer.generate_report(reviews)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
101
setup_twitter_matching_schema.sql
Normal file
101
setup_twitter_matching_schema.sql
Normal file
@@ -0,0 +1,101 @@
|
|||||||
|
-- Twitter-Telegram Matching Schema
|
||||||
|
-- Run this against your telegram_contacts database
|
||||||
|
|
||||||
|
-- Table: twitter_telegram_matches
|
||||||
|
-- Stores confirmed matches between Telegram and Twitter profiles
|
||||||
|
CREATE TABLE IF NOT EXISTS twitter_telegram_matches (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
|
||||||
|
-- Telegram side
|
||||||
|
telegram_user_id BIGINT NOT NULL REFERENCES contacts(user_id),
|
||||||
|
telegram_username VARCHAR(255),
|
||||||
|
telegram_name VARCHAR(255),
|
||||||
|
telegram_bio TEXT,
|
||||||
|
|
||||||
|
-- Twitter side
|
||||||
|
twitter_id VARCHAR(50) NOT NULL,
|
||||||
|
twitter_username VARCHAR(100) NOT NULL,
|
||||||
|
twitter_name VARCHAR(200),
|
||||||
|
twitter_bio TEXT,
|
||||||
|
twitter_location VARCHAR(200),
|
||||||
|
twitter_verified BOOLEAN,
|
||||||
|
twitter_blue_verified BOOLEAN,
|
||||||
|
twitter_followers_count INTEGER,
|
||||||
|
|
||||||
|
-- Matching metadata
|
||||||
|
match_method VARCHAR(50) NOT NULL, -- 'exact_bio_handle', 'exact_username', 'username_variation', 'fuzzy_name'
|
||||||
|
baseline_confidence FLOAT NOT NULL, -- Confidence before LLM (0-1)
|
||||||
|
llm_verdict VARCHAR(20) NOT NULL, -- 'YES', 'NO', 'UNSURE'
|
||||||
|
final_confidence FLOAT NOT NULL CHECK (final_confidence BETWEEN 0 AND 1),
|
||||||
|
|
||||||
|
-- Match details (JSON for flexibility)
|
||||||
|
match_details JSONB, -- {extracted_handles: [...], username_variation: 'xxx', fuzzy_score: 0.85}
|
||||||
|
|
||||||
|
-- LLM metadata
|
||||||
|
llm_tokens_used INTEGER,
|
||||||
|
llm_cost FLOAT,
|
||||||
|
|
||||||
|
-- Audit trail
|
||||||
|
matched_at TIMESTAMP DEFAULT NOW(),
|
||||||
|
needs_manual_review BOOLEAN DEFAULT FALSE,
|
||||||
|
verified_manually BOOLEAN DEFAULT FALSE,
|
||||||
|
manual_review_notes TEXT,
|
||||||
|
|
||||||
|
UNIQUE(telegram_user_id, twitter_id)
|
||||||
|
);
|
||||||
|
|
||||||
|
-- Indexes for performance
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_ttm_telegram_user ON twitter_telegram_matches(telegram_user_id);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_ttm_twitter_id ON twitter_telegram_matches(twitter_id);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_ttm_twitter_username ON twitter_telegram_matches(LOWER(twitter_username));
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_ttm_confidence ON twitter_telegram_matches(final_confidence DESC);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_ttm_verdict ON twitter_telegram_matches(llm_verdict);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_ttm_method ON twitter_telegram_matches(match_method);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_ttm_needs_review ON twitter_telegram_matches(needs_manual_review) WHERE needs_manual_review = TRUE;
|
||||||
|
|
||||||
|
-- Table: twitter_match_candidates (temporary staging)
|
||||||
|
-- Stores potential matches before LLM verification
|
||||||
|
CREATE TABLE IF NOT EXISTS twitter_match_candidates (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
|
||||||
|
telegram_user_id BIGINT NOT NULL REFERENCES contacts(user_id),
|
||||||
|
|
||||||
|
-- Twitter candidate info
|
||||||
|
twitter_id VARCHAR(50) NOT NULL,
|
||||||
|
twitter_username VARCHAR(100) NOT NULL,
|
||||||
|
twitter_name VARCHAR(200),
|
||||||
|
twitter_bio TEXT,
|
||||||
|
twitter_location VARCHAR(200),
|
||||||
|
twitter_verified BOOLEAN,
|
||||||
|
twitter_blue_verified BOOLEAN,
|
||||||
|
twitter_followers_count INTEGER,
|
||||||
|
|
||||||
|
-- Candidate scoring
|
||||||
|
candidate_rank INTEGER, -- 1 = best match, 2 = second best, etc.
|
||||||
|
match_method VARCHAR(50),
|
||||||
|
baseline_confidence FLOAT,
|
||||||
|
match_signals JSONB, -- {'handle_match': true, 'fuzzy_score': 0.85, ...}
|
||||||
|
|
||||||
|
-- LLM processing status
|
||||||
|
needs_llm_review BOOLEAN DEFAULT TRUE,
|
||||||
|
llm_processed BOOLEAN DEFAULT FALSE,
|
||||||
|
llm_verdict VARCHAR(20),
|
||||||
|
final_confidence FLOAT,
|
||||||
|
|
||||||
|
created_at TIMESTAMP DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_tmc_telegram_user ON twitter_match_candidates(telegram_user_id);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_tmc_twitter_id ON twitter_match_candidates(twitter_id);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_tmc_needs_review ON twitter_match_candidates(llm_processed, needs_llm_review)
|
||||||
|
WHERE needs_llm_review = TRUE AND llm_processed = FALSE;
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_tmc_rank ON twitter_match_candidates(telegram_user_id, candidate_rank);
|
||||||
|
|
||||||
|
-- Grant permissions (adjust as needed for your setup)
|
||||||
|
-- GRANT ALL PRIVILEGES ON twitter_telegram_matches TO your_user;
|
||||||
|
-- GRANT ALL PRIVILEGES ON twitter_match_candidates TO your_user;
|
||||||
|
-- GRANT USAGE, SELECT ON SEQUENCE twitter_telegram_matches_id_seq TO your_user;
|
||||||
|
-- GRANT USAGE, SELECT ON SEQUENCE twitter_match_candidates_id_seq TO your_user;
|
||||||
|
|
||||||
|
COMMENT ON TABLE twitter_telegram_matches IS 'Confirmed matches between Telegram and Twitter profiles';
|
||||||
|
COMMENT ON TABLE twitter_match_candidates IS 'Temporary staging for potential matches awaiting LLM verification';
|
||||||
791
verify_twitter_matches_v2.py
Executable file
791
verify_twitter_matches_v2.py
Executable file
@@ -0,0 +1,791 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Twitter-Telegram Match Verifier V2 (Confidence-Based with Batch Evaluation)
|
||||||
|
Uses LLM with confidence scoring (0-1) and evaluates all candidates for a TG user together
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Dict
|
||||||
|
import psycopg2
|
||||||
|
from psycopg2.extras import DictCursor, RealDictCursor
|
||||||
|
import openai
|
||||||
|
from datetime import datetime
|
||||||
|
import re
|
||||||
|
|
||||||
|
# Add parent directory to path
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
|
||||||
|
|
||||||
|
from db_config import SessionLocal
|
||||||
|
from models import Contact
|
||||||
|
|
||||||
|
# Twitter database connection
|
||||||
|
TWITTER_DB_CONFIG = {
|
||||||
|
'dbname': 'twitter_data',
|
||||||
|
'user': 'andrewjiang',
|
||||||
|
'host': 'localhost',
|
||||||
|
'port': 5432
|
||||||
|
}
|
||||||
|
|
||||||
|
# Checkpoint file
|
||||||
|
CHECKPOINT_FILE = Path(__file__).parent.parent / 'llm_verification_v2_checkpoint.json'
|
||||||
|
|
||||||
|
|
||||||
|
class CheckpointManager:
|
||||||
|
"""Manage checkpoints for resumable verification"""
|
||||||
|
|
||||||
|
def __init__(self, checkpoint_file):
|
||||||
|
self.checkpoint_file = checkpoint_file
|
||||||
|
self.data = self.load()
|
||||||
|
|
||||||
|
def load(self):
|
||||||
|
if self.checkpoint_file.exists():
|
||||||
|
with open(self.checkpoint_file, 'r') as f:
|
||||||
|
return json.load(f)
|
||||||
|
return {
|
||||||
|
'last_processed_telegram_id': None,
|
||||||
|
'processed_count': 0,
|
||||||
|
'total_matches_saved': 0,
|
||||||
|
'total_cost': 0.0,
|
||||||
|
'total_tokens': 0,
|
||||||
|
'started_at': None
|
||||||
|
}
|
||||||
|
|
||||||
|
def save(self):
|
||||||
|
self.data['last_updated_at'] = datetime.now().isoformat()
|
||||||
|
with open(self.checkpoint_file, 'w') as f:
|
||||||
|
json.dump(self.data, f, indent=2)
|
||||||
|
|
||||||
|
def update(self, telegram_user_id, matches_saved, tokens_used, cost):
|
||||||
|
self.data['last_processed_telegram_id'] = telegram_user_id
|
||||||
|
self.data['processed_count'] += 1
|
||||||
|
self.data['total_matches_saved'] += matches_saved
|
||||||
|
self.data['total_tokens'] += tokens_used
|
||||||
|
self.data['total_cost'] += cost
|
||||||
|
|
||||||
|
# Save every 20 telegram users
|
||||||
|
if self.data['processed_count'] % 20 == 0:
|
||||||
|
self.save()
|
||||||
|
|
||||||
|
|
||||||
|
def calculate_username_specificity(username: str) -> float:
|
||||||
|
"""
|
||||||
|
Calculate how specific/unique a username is (0-1)
|
||||||
|
|
||||||
|
Generic usernames get lower scores, unique ones get higher scores
|
||||||
|
"""
|
||||||
|
if not username:
|
||||||
|
return 0.5
|
||||||
|
|
||||||
|
username_lower = username.lower()
|
||||||
|
|
||||||
|
# Very generic patterns
|
||||||
|
generic_patterns = [
|
||||||
|
r'^admin\d*$', r'^user\d+$', r'^crypto\d*$', r'^nft\d*$',
|
||||||
|
r'^web3\d*$', r'^trader\d*$', r'^dev\d*$', r'^official\d*$',
|
||||||
|
r'^team\d*$', r'^support\d*$', r'^info\d*$'
|
||||||
|
]
|
||||||
|
|
||||||
|
for pattern in generic_patterns:
|
||||||
|
if re.match(pattern, username_lower):
|
||||||
|
return 0.3
|
||||||
|
|
||||||
|
# Length-based scoring
|
||||||
|
length = len(username)
|
||||||
|
if length < 4:
|
||||||
|
return 0.4
|
||||||
|
elif length < 6:
|
||||||
|
return 0.6
|
||||||
|
elif length < 8:
|
||||||
|
return 0.75
|
||||||
|
else:
|
||||||
|
return 0.9
|
||||||
|
|
||||||
|
|
||||||
|
class LLMVerifier:
|
||||||
|
"""Use GPT-5 Nano for confidence-based verification"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.client = openai.AsyncOpenAI()
|
||||||
|
self.model = "gpt-5-mini" # GPT-5 Mini - balanced performance and cost
|
||||||
|
|
||||||
|
# Cost calculation for gpt-5-mini
|
||||||
|
self.input_cost_per_1m = 0.25 # $0.25 per 1M input tokens
|
||||||
|
self.output_cost_per_1m = 2.00 # $2.00 per 1M output tokens
|
||||||
|
|
||||||
|
def build_batch_prompt(self, telegram_profile: Dict, candidates: List[Dict]) -> str:
|
||||||
|
"""Build LLM prompt for batch evaluation of all candidates"""
|
||||||
|
|
||||||
|
# Format Telegram profile
|
||||||
|
tg_bio = telegram_profile.get('bio', 'none')
|
||||||
|
if tg_bio and len(tg_bio) > 300:
|
||||||
|
tg_bio = tg_bio[:300] + '...'
|
||||||
|
|
||||||
|
# Format chat context
|
||||||
|
chat_context = telegram_profile.get('chat_context', {})
|
||||||
|
chat_info = ""
|
||||||
|
if chat_context.get('chat_titles'):
|
||||||
|
chat_info = f"\nGroup Chats: {chat_context['chat_titles']}"
|
||||||
|
if chat_context.get('is_crypto_focused'):
|
||||||
|
chat_info += " [CRYPTO-FOCUSED GROUPS]"
|
||||||
|
|
||||||
|
prompt = f"""TELEGRAM PROFILE:
|
||||||
|
Username: @{telegram_profile.get('username') or 'none'}
|
||||||
|
Display Name: {telegram_profile.get('name', 'none')} (first_name + last_name combined)
|
||||||
|
Bio: {tg_bio}{chat_info}
|
||||||
|
|
||||||
|
TWITTER CANDIDATES (evaluate all together):
|
||||||
|
"""
|
||||||
|
|
||||||
|
for i, candidate in enumerate(candidates, 1):
|
||||||
|
tw_bio = candidate.get('twitter_bio', 'none')
|
||||||
|
if tw_bio and len(tw_bio) > 250:
|
||||||
|
tw_bio = tw_bio[:250] + '...'
|
||||||
|
|
||||||
|
# Check if phash match info exists
|
||||||
|
phash_info = ""
|
||||||
|
if candidate['match_method'] == 'phash_match':
|
||||||
|
phash_distance = candidate.get('match_signals', {}).get('phash_distance', 'unknown')
|
||||||
|
phash_info = f"\n⭐ Phash Match: distance={phash_distance} (identical profile pictures!)"
|
||||||
|
|
||||||
|
prompt += f"""
|
||||||
|
[Candidate {i}]
|
||||||
|
Twitter Username: @{candidate['twitter_username']}
|
||||||
|
Twitter Display Name: {candidate.get('twitter_name', 'Unknown')}
|
||||||
|
Twitter Bio: {tw_bio}
|
||||||
|
Location: {candidate.get('twitter_location') or 'none'}
|
||||||
|
Verified: {candidate.get('twitter_verified', False)} (Blue: {candidate.get('twitter_blue_verified', False)})
|
||||||
|
Followers: {candidate.get('twitter_followers_count', 0):,}
|
||||||
|
Match Method: {candidate['match_method']}{phash_info}
|
||||||
|
Baseline Confidence: {candidate.get('baseline_confidence', 0):.2f}
|
||||||
|
"""
|
||||||
|
|
||||||
|
return prompt
|
||||||
|
|
||||||
|
async def verify_batch(self, telegram_profile: Dict, candidates: List[Dict], semaphore, log_file=None) -> Dict:
|
||||||
|
"""Verify all candidates for a single Telegram user"""
|
||||||
|
|
||||||
|
async with semaphore:
|
||||||
|
prompt = self.build_batch_prompt(telegram_profile, candidates)
|
||||||
|
|
||||||
|
if log_file:
|
||||||
|
log_file.write(f"\n{'=' * 100}\n")
|
||||||
|
log_file.write(f"TELEGRAM USER: {telegram_profile.get('username', 'N/A')} (ID: {telegram_profile['user_id']})\n")
|
||||||
|
log_file.write(f"{'=' * 100}\n\n")
|
||||||
|
|
||||||
|
system_prompt = """You are an expert at determining if two social media profiles belong to the same person.
|
||||||
|
|
||||||
|
# TASK
|
||||||
|
Determine confidence (0.0-1.0) that each Twitter candidate is the same person as the Telegram profile.
|
||||||
|
|
||||||
|
**CRITICAL: Evaluate ALL candidates together, not in isolation. Compare them against each other to identify which has the STRONGEST evidence.**
|
||||||
|
|
||||||
|
# SIGNAL STRENGTH GUIDE
|
||||||
|
|
||||||
|
Evaluate the FULL CONTEXT of all available signals together. Individual signals can be strong or weak, but the overall picture matters most.
|
||||||
|
|
||||||
|
## VERY STRONG SIGNALS (Can individually suggest high confidence)
|
||||||
|
- **Explicit bio mention**: TG bio says "x.com/username" or "Follow me @username"
|
||||||
|
- ⚠️ EXCEPTION: If that account is clearly a company/project (not personal), this is NOT definitive
|
||||||
|
- Example: TG bio "x.com/gems_gun" but @gems_gun is company account → Look for personal account like @lucahl0 "Building @gems_gun"
|
||||||
|
- **Unique username exact match**: Unusual/long username (like @kupermind, @schellinger_k) that matches exactly
|
||||||
|
- Generic usernames (@mike, @crypto123) don't qualify as "unique"
|
||||||
|
|
||||||
|
## STRONG SUPPORTING SIGNALS (Good indicators when combined)
|
||||||
|
Each of these helps build confidence, especially when multiple align:
|
||||||
|
- Full name match (after normalization: remove .eth, emojis, separators)
|
||||||
|
- Same profile picture (phash match)
|
||||||
|
- Aligned bio themes/context (both in crypto, both mention same projects/interests)
|
||||||
|
- Very similar username (not exact, but close: @kevin vs @k_kevin)
|
||||||
|
|
||||||
|
## WEAK SIGNALS (Need multiple strong signals to be meaningful)
|
||||||
|
- Generic name match only (Alex, Mike, David, John, Baz)
|
||||||
|
- Same general field but no specifics
|
||||||
|
- Partial username similarity with generic name
|
||||||
|
|
||||||
|
## RED FLAGS (Lower confidence significantly)
|
||||||
|
- Context mismatch: TG is crypto/tech, TW is chef/athlete/journalist
|
||||||
|
- Company account when looking for personal profile
|
||||||
|
- Famous person/celebrity (unless clear evidence it's actually them)
|
||||||
|
|
||||||
|
# CONFIDENCE BANDS
|
||||||
|
|
||||||
|
## 0.90-1.0: NEARLY CERTAIN
|
||||||
|
Very strong signal (bio mention of personal account OR unique username match) + supporting signals align
|
||||||
|
Examples:
|
||||||
|
- TG bio: "https://x.com/kupermind" + TW @kupermind personal account → 0.97
|
||||||
|
- TG @olliten + TW @olliten + same name + same pic → 0.98
|
||||||
|
|
||||||
|
## 0.70-0.89: LIKELY
|
||||||
|
Multiple strong supporting signals converge, or one very strong signal with some gap
|
||||||
|
Examples:
|
||||||
|
- Very similar username + name match + context: @schellinger → @k_schellinger "Kevin Schellinger" → 0.85
|
||||||
|
- Exact username on moderately unique name: @alexcrypto + "Alex Smith" crypto → 0.78
|
||||||
|
|
||||||
|
## 0.40-0.69: POSSIBLE
|
||||||
|
Some evidence but significant uncertainty
|
||||||
|
- Generic name + same field but no username/pic match
|
||||||
|
- Weak username similarity with generic name
|
||||||
|
- Profile pic match but name is very generic
|
||||||
|
|
||||||
|
## 0.10-0.39: UNLIKELY
|
||||||
|
Minimal evidence or contradictions
|
||||||
|
- Only generic name match (David, Alex, Mike)
|
||||||
|
- Context mismatch (crypto person vs chef)
|
||||||
|
|
||||||
|
## 0.0-0.09: EXTREMELY UNLIKELY
|
||||||
|
No meaningful evidence or clear contradiction
|
||||||
|
|
||||||
|
# COMPARATIVE EVALUATION PROCESS
|
||||||
|
|
||||||
|
**Step 1: Review ALL candidates together**
|
||||||
|
Don't score each in isolation. Look at the full set to understand which has the strongest evidence.
|
||||||
|
|
||||||
|
**Step 2: Identify the strongest signals present**
|
||||||
|
- Is there a bio mention? (Check if it's personal vs company account!)
|
||||||
|
- Is there a unique username match?
|
||||||
|
- Do multiple supporting signals converge for one candidate?
|
||||||
|
|
||||||
|
**Step 3: Apply differential scoring**
|
||||||
|
- The candidate with STRONGEST evidence should get meaningfully higher score
|
||||||
|
- If Candidate A has unique username + name match, and Candidate B only has generic name → A gets 0.85+, B gets 0.40 max
|
||||||
|
- If ALL candidates only have weak signals (generic name only) → ALL score 0.20-0.40
|
||||||
|
|
||||||
|
**Step 4: Sanity checks**
|
||||||
|
- Could this evidence match thousands of people? → Lower confidence
|
||||||
|
- Is there a context mismatch? → Max 0.50
|
||||||
|
- Is this a company account when we need personal? → Not the right match
|
||||||
|
|
||||||
|
**Key principle: Only ONE candidate can be "most likely" - differentiate clearly between them.**
|
||||||
|
|
||||||
|
# TECHNICAL NOTES
|
||||||
|
|
||||||
|
**Name Normalization**: Before comparing, remove .eth/.ton/.sol suffixes, emojis, "| company" separators, and ignore capitalization
|
||||||
|
|
||||||
|
**Profile Picture (phash)**: Phash match alone → MAX 0.70 (supporting signal). Use to break ties or add confidence to other signals.
|
||||||
|
|
||||||
|
# OUTPUT FORMAT
|
||||||
|
Return ONLY valid JSON (no markdown, no explanation):
|
||||||
|
{{
|
||||||
|
"candidates": [
|
||||||
|
{{
|
||||||
|
"candidate_index": 1,
|
||||||
|
"confidence": 0.85,
|
||||||
|
"reasoning": "Brief explanation"
|
||||||
|
}},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
}}"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
if log_file:
|
||||||
|
log_file.write("SYSTEM PROMPT:\n")
|
||||||
|
log_file.write("-" * 100 + "\n")
|
||||||
|
log_file.write(system_prompt + "\n\n")
|
||||||
|
log_file.write("USER PROMPT:\n")
|
||||||
|
log_file.write("-" * 100 + "\n")
|
||||||
|
log_file.write(prompt + "\n\n")
|
||||||
|
log_file.flush()
|
||||||
|
|
||||||
|
response = await self.client.chat.completions.create(
|
||||||
|
model=self.model,
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": system_prompt},
|
||||||
|
{"role": "user", "content": prompt}
|
||||||
|
],
|
||||||
|
response_format={"type": "json_object"}
|
||||||
|
)
|
||||||
|
|
||||||
|
content = response.choices[0].message.content.strip()
|
||||||
|
|
||||||
|
if log_file:
|
||||||
|
log_file.write("LLM RESPONSE:\n")
|
||||||
|
log_file.write("-" * 100 + "\n")
|
||||||
|
log_file.write(content + "\n\n")
|
||||||
|
log_file.flush()
|
||||||
|
|
||||||
|
# Parse JSON
|
||||||
|
try:
|
||||||
|
result = json.loads(content)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
print(f" ⚠️ Failed to parse JSON response")
|
||||||
|
return {
|
||||||
|
'success': False,
|
||||||
|
'error': 'json_parse_error',
|
||||||
|
'tokens_used': response.usage.total_tokens,
|
||||||
|
'cost': self.calculate_cost(response.usage)
|
||||||
|
}
|
||||||
|
|
||||||
|
tokens_used = response.usage.total_tokens
|
||||||
|
cost = self.calculate_cost(response.usage)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'success': True,
|
||||||
|
'results': result.get('candidates', []),
|
||||||
|
'tokens_used': tokens_used,
|
||||||
|
'cost': cost,
|
||||||
|
'error': None
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" ⚠️ LLM error: {str(e)[:100]}")
|
||||||
|
return {
|
||||||
|
'success': False,
|
||||||
|
'error': str(e),
|
||||||
|
'tokens_used': 0,
|
||||||
|
'cost': 0.0
|
||||||
|
}
|
||||||
|
|
||||||
|
def calculate_cost(self, usage) -> float:
|
||||||
|
"""Calculate cost for this API call"""
|
||||||
|
input_cost = (usage.prompt_tokens / 1_000_000) * self.input_cost_per_1m
|
||||||
|
output_cost = (usage.completion_tokens / 1_000_000) * self.output_cost_per_1m
|
||||||
|
return input_cost + output_cost
|
||||||
|
|
||||||
|
|
||||||
|
def get_telegram_users_with_candidates(telegram_conn, checkpoint_manager, limit=None):
|
||||||
|
"""Get list of telegram_user_ids that have unprocessed candidates"""
|
||||||
|
with telegram_conn.cursor() as cur:
|
||||||
|
query = """
|
||||||
|
SELECT DISTINCT telegram_user_id
|
||||||
|
FROM twitter_match_candidates
|
||||||
|
WHERE needs_llm_review = TRUE
|
||||||
|
AND llm_processed = FALSE
|
||||||
|
"""
|
||||||
|
|
||||||
|
if checkpoint_manager.data['last_processed_telegram_id']:
|
||||||
|
query += f" AND telegram_user_id > {checkpoint_manager.data['last_processed_telegram_id']}"
|
||||||
|
|
||||||
|
query += " ORDER BY telegram_user_id"
|
||||||
|
|
||||||
|
if limit:
|
||||||
|
query += f" LIMIT {limit}"
|
||||||
|
|
||||||
|
cur.execute(query)
|
||||||
|
return [row[0] for row in cur.fetchall()]
|
||||||
|
|
||||||
|
|
||||||
|
def get_candidates_for_telegram_user(telegram_user_id: int, telegram_conn):
|
||||||
|
"""Get all candidates for a specific Telegram user"""
|
||||||
|
with telegram_conn.cursor(cursor_factory=RealDictCursor) as cur:
|
||||||
|
cur.execute("""
|
||||||
|
SELECT *
|
||||||
|
FROM twitter_match_candidates
|
||||||
|
WHERE telegram_user_id = %s
|
||||||
|
AND needs_llm_review = TRUE
|
||||||
|
AND llm_processed = FALSE
|
||||||
|
ORDER BY baseline_confidence DESC
|
||||||
|
""", (telegram_user_id,))
|
||||||
|
return [dict(row) for row in cur.fetchall()]
|
||||||
|
|
||||||
|
|
||||||
|
def get_user_chat_context(user_id: int, telegram_conn) -> Dict:
|
||||||
|
"""Get chat participation context for a user"""
|
||||||
|
with telegram_conn.cursor(cursor_factory=RealDictCursor) as cur:
|
||||||
|
cur.execute("""
|
||||||
|
SELECT
|
||||||
|
STRING_AGG(DISTINCT c.title, ' | ') FILTER (WHERE c.title IS NOT NULL) as chat_titles,
|
||||||
|
COUNT(DISTINCT cp.chat_id) as chat_count
|
||||||
|
FROM chat_participants cp
|
||||||
|
JOIN chats c ON cp.chat_id = c.chat_id
|
||||||
|
WHERE cp.user_id = %s
|
||||||
|
AND c.title IS NOT NULL
|
||||||
|
AND c.chat_type != 'private'
|
||||||
|
""", (user_id,))
|
||||||
|
result = cur.fetchone()
|
||||||
|
|
||||||
|
if result and result['chat_titles']:
|
||||||
|
# Check if chats indicate crypto/web3 interest
|
||||||
|
chat_titles_lower = result['chat_titles'].lower()
|
||||||
|
crypto_keywords = ['crypto', 'bitcoin', 'eth', 'defi', 'nft', 'dao', 'web3', 'blockchain',
|
||||||
|
'solana', 'near', 'avalanche', 'polygon', 'base', 'arbitrum', 'optimism',
|
||||||
|
'cosmos', 'builders', 'degen', 'lobster']
|
||||||
|
is_crypto_focused = any(keyword in chat_titles_lower for keyword in crypto_keywords)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'chat_titles': result['chat_titles'],
|
||||||
|
'chat_count': result['chat_count'],
|
||||||
|
'is_crypto_focused': is_crypto_focused
|
||||||
|
}
|
||||||
|
|
||||||
|
return {'chat_titles': None, 'chat_count': 0, 'is_crypto_focused': False}
|
||||||
|
|
||||||
|
|
||||||
|
async def verify_telegram_user(telegram_user_id: int, verifier: LLMVerifier, checkpoint_manager: CheckpointManager,
|
||||||
|
telegram_db, telegram_conn, log_file=None) -> Dict:
|
||||||
|
"""Verify all candidates for a single Telegram user"""
|
||||||
|
|
||||||
|
# Get Telegram profile
|
||||||
|
telegram_profile = telegram_db.query(Contact).filter(
|
||||||
|
Contact.user_id == telegram_user_id
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if not telegram_profile:
|
||||||
|
return {'matches_saved': 0, 'tokens': 0, 'cost': 0}
|
||||||
|
|
||||||
|
# Construct display name from first_name + last_name
|
||||||
|
display_name = (telegram_profile.first_name or '') + (' ' + telegram_profile.last_name if telegram_profile.last_name else '')
|
||||||
|
display_name = display_name.strip() or None
|
||||||
|
|
||||||
|
# Get chat participation context
|
||||||
|
chat_context = get_user_chat_context(telegram_user_id, telegram_conn)
|
||||||
|
|
||||||
|
telegram_dict = {
|
||||||
|
'user_id': telegram_profile.user_id,
|
||||||
|
'account_id': telegram_profile.account_id,
|
||||||
|
'username': telegram_profile.username,
|
||||||
|
'name': display_name,
|
||||||
|
'bio': telegram_profile.bio,
|
||||||
|
'chat_context': chat_context
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get all candidates
|
||||||
|
candidates = get_candidates_for_telegram_user(telegram_user_id, telegram_conn)
|
||||||
|
|
||||||
|
if not candidates:
|
||||||
|
return {'matches_saved': 0, 'tokens': 0, 'cost': 0}
|
||||||
|
|
||||||
|
# Verify with LLM
|
||||||
|
semaphore = asyncio.Semaphore(1) # One at a time for now
|
||||||
|
llm_result = await verifier.verify_batch(telegram_dict, candidates, semaphore, log_file)
|
||||||
|
|
||||||
|
if not llm_result['success']:
|
||||||
|
# Mark as processed even on error so we don't retry infinitely
|
||||||
|
mark_candidates_processed([c['id'] for c in candidates], telegram_conn)
|
||||||
|
return {
|
||||||
|
'matches_saved': 0,
|
||||||
|
'tokens': llm_result['tokens_used'],
|
||||||
|
'cost': llm_result['cost']
|
||||||
|
}
|
||||||
|
|
||||||
|
# Save matches
|
||||||
|
matches_saved = save_verified_matches(
|
||||||
|
telegram_dict,
|
||||||
|
candidates,
|
||||||
|
llm_result['results'],
|
||||||
|
telegram_conn
|
||||||
|
)
|
||||||
|
|
||||||
|
# Mark candidates as processed
|
||||||
|
mark_candidates_processed([c['id'] for c in candidates], telegram_conn)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'matches_saved': matches_saved,
|
||||||
|
'tokens': llm_result['tokens_used'],
|
||||||
|
'cost': llm_result['cost']
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def save_verified_matches(telegram_profile: Dict, candidates: List[Dict], llm_results: List[Dict], telegram_conn):
|
||||||
|
"""Save verified matches with confidence scores"""
|
||||||
|
|
||||||
|
matches_to_save = []
|
||||||
|
|
||||||
|
# CRITICAL: Post-process to fix generic name false positives
|
||||||
|
tg_name = telegram_profile.get('name', '')
|
||||||
|
tg_username = telegram_profile.get('username', '')
|
||||||
|
tg_bio = telegram_profile.get('bio', '')
|
||||||
|
|
||||||
|
# Check if profile has generic characteristics
|
||||||
|
has_generic_name = len(tg_name) <= 7 # Short display name
|
||||||
|
has_generic_username = len(tg_username) <= 8 # Short username
|
||||||
|
has_empty_bio = not tg_bio or len(tg_bio) <= 20
|
||||||
|
|
||||||
|
is_generic_profile = has_empty_bio and (has_generic_name or has_generic_username)
|
||||||
|
|
||||||
|
for llm_result in llm_results:
|
||||||
|
candidate_idx = llm_result.get('candidate_index', 0) - 1 # Convert to 0-indexed
|
||||||
|
|
||||||
|
if candidate_idx < 0 or candidate_idx >= len(candidates):
|
||||||
|
continue
|
||||||
|
|
||||||
|
confidence = llm_result.get('confidence', 0)
|
||||||
|
reasoning = llm_result.get('reasoning', '')
|
||||||
|
|
||||||
|
# Only save if confidence >= 0.5 (moderate or higher)
|
||||||
|
if confidence < 0.5:
|
||||||
|
continue
|
||||||
|
|
||||||
|
candidate = candidates[candidate_idx]
|
||||||
|
|
||||||
|
# CRITICAL FIX: Cap confidence for generic profiles with weak match methods
|
||||||
|
# Weak match methods are those based purely on name/username containment
|
||||||
|
weak_match_methods = [
|
||||||
|
'display_name_containment',
|
||||||
|
'fuzzy_name',
|
||||||
|
'tg_username_in_twitter_name',
|
||||||
|
'twitter_username_in_tg_name'
|
||||||
|
]
|
||||||
|
|
||||||
|
if is_generic_profile and candidate['match_method'] in weak_match_methods:
|
||||||
|
# Cap at 0.70 unless it's a strong signal (phash, exact_username, exact_bio_handle)
|
||||||
|
if confidence > 0.70:
|
||||||
|
confidence = 0.70
|
||||||
|
reasoning += " [Confidence capped at 0.70: generic profile + weak match method]"
|
||||||
|
|
||||||
|
match_details = {
|
||||||
|
'match_method': candidate['match_method'],
|
||||||
|
'baseline_confidence': candidate['baseline_confidence'],
|
||||||
|
'llm_confidence': confidence,
|
||||||
|
'llm_reasoning': reasoning
|
||||||
|
}
|
||||||
|
|
||||||
|
matches_to_save.append((
|
||||||
|
telegram_profile['account_id'],
|
||||||
|
telegram_profile['user_id'],
|
||||||
|
telegram_profile.get('username'),
|
||||||
|
telegram_profile.get('name'),
|
||||||
|
telegram_profile.get('bio'),
|
||||||
|
candidate['twitter_id'],
|
||||||
|
candidate['twitter_username'],
|
||||||
|
candidate.get('twitter_name'),
|
||||||
|
candidate.get('twitter_bio'),
|
||||||
|
candidate.get('twitter_location'),
|
||||||
|
candidate.get('twitter_verified', False),
|
||||||
|
candidate.get('twitter_blue_verified', False),
|
||||||
|
candidate.get('twitter_followers_count', 0),
|
||||||
|
candidate['match_method'],
|
||||||
|
candidate['baseline_confidence'],
|
||||||
|
'CONFIDENT' if confidence >= 0.8 else 'MODERATE' if confidence >= 0.6 else 'UNSURE',
|
||||||
|
confidence,
|
||||||
|
json.dumps(match_details),
|
||||||
|
llm_result.get('tokens_used', 0),
|
||||||
|
0, # cost will be aggregated at the batch level
|
||||||
|
confidence < 0.75 # needs_manual_review
|
||||||
|
))
|
||||||
|
|
||||||
|
if matches_to_save:
|
||||||
|
with telegram_conn.cursor() as cur:
|
||||||
|
cur.executemany("""
|
||||||
|
INSERT INTO twitter_telegram_matches (
|
||||||
|
account_id,
|
||||||
|
telegram_user_id,
|
||||||
|
telegram_username,
|
||||||
|
telegram_name,
|
||||||
|
telegram_bio,
|
||||||
|
twitter_id,
|
||||||
|
twitter_username,
|
||||||
|
twitter_name,
|
||||||
|
twitter_bio,
|
||||||
|
twitter_location,
|
||||||
|
twitter_verified,
|
||||||
|
twitter_blue_verified,
|
||||||
|
twitter_followers_count,
|
||||||
|
match_method,
|
||||||
|
baseline_confidence,
|
||||||
|
llm_verdict,
|
||||||
|
final_confidence,
|
||||||
|
match_details,
|
||||||
|
llm_tokens_used,
|
||||||
|
llm_cost,
|
||||||
|
needs_manual_review
|
||||||
|
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
|
||||||
|
ON CONFLICT (telegram_user_id, twitter_id) DO UPDATE SET
|
||||||
|
llm_verdict = EXCLUDED.llm_verdict,
|
||||||
|
final_confidence = EXCLUDED.final_confidence,
|
||||||
|
matched_at = NOW()
|
||||||
|
""", matches_to_save)
|
||||||
|
telegram_conn.commit()
|
||||||
|
|
||||||
|
return len(matches_to_save)
|
||||||
|
|
||||||
|
|
||||||
|
def mark_candidates_processed(candidate_ids: List[int], telegram_conn):
|
||||||
|
"""Mark candidates as LLM processed"""
|
||||||
|
with telegram_conn.cursor() as cur:
|
||||||
|
cur.execute("""
|
||||||
|
UPDATE twitter_match_candidates
|
||||||
|
SET llm_processed = TRUE
|
||||||
|
WHERE id = ANY(%s)
|
||||||
|
""", (candidate_ids,))
|
||||||
|
telegram_conn.commit()
|
||||||
|
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
print()
|
||||||
|
print("=" * 70)
|
||||||
|
print("🤖 Twitter-Telegram Match Verifier V2 (Confidence-Based)")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Check arguments
|
||||||
|
test_mode = '--test' in sys.argv
|
||||||
|
verbose = '--verbose' in sys.argv
|
||||||
|
limit = 100 if test_mode else None
|
||||||
|
|
||||||
|
# Check for concurrency argument
|
||||||
|
concurrent_requests = 10 # Default
|
||||||
|
for i, arg in enumerate(sys.argv):
|
||||||
|
if arg == '--concurrent' and i + 1 < len(sys.argv):
|
||||||
|
try:
|
||||||
|
concurrent_requests = int(sys.argv[i + 1])
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Setup verbose logging
|
||||||
|
log_file = None
|
||||||
|
if verbose:
|
||||||
|
log_file = open(Path(__file__).parent.parent / 'verification_v2_log.txt', 'w')
|
||||||
|
log_file.write("=" * 100 + "\n")
|
||||||
|
log_file.write("VERIFICATION V2 LOG\n")
|
||||||
|
log_file.write("=" * 100 + "\n\n")
|
||||||
|
|
||||||
|
if test_mode:
|
||||||
|
print("🧪 TEST MODE: Processing first 100 Telegram users only")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Initialize checkpoint
|
||||||
|
checkpoint_manager = CheckpointManager(CHECKPOINT_FILE)
|
||||||
|
|
||||||
|
if checkpoint_manager.data['last_processed_telegram_id']:
|
||||||
|
print(f"📍 Resuming from telegram_user_id {checkpoint_manager.data['last_processed_telegram_id']}")
|
||||||
|
print(f" Already processed: {checkpoint_manager.data['processed_count']:,}")
|
||||||
|
print(f" Cost so far: ${checkpoint_manager.data['total_cost']:.4f}")
|
||||||
|
print()
|
||||||
|
else:
|
||||||
|
checkpoint_manager.data['started_at'] = datetime.now().isoformat()
|
||||||
|
|
||||||
|
# Connect to databases
|
||||||
|
print("📡 Connecting to databases...")
|
||||||
|
telegram_db = SessionLocal()
|
||||||
|
|
||||||
|
try:
|
||||||
|
telegram_conn = psycopg2.connect(dbname='telegram_contacts', user='andrewjiang', host='localhost', port=5432)
|
||||||
|
telegram_conn.autocommit = False
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Failed to connect to Telegram database: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
print("✅ Connected")
|
||||||
|
print()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Load telegram users needing verification
|
||||||
|
print("🔍 Loading telegram users with candidates...")
|
||||||
|
telegram_user_ids = get_telegram_users_with_candidates(telegram_conn, checkpoint_manager, limit)
|
||||||
|
|
||||||
|
if not telegram_user_ids:
|
||||||
|
print("✅ No users to verify!")
|
||||||
|
return True
|
||||||
|
|
||||||
|
print(f"✅ Found {len(telegram_user_ids):,} telegram users to verify")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Estimate cost (rough)
|
||||||
|
estimated_cost = len(telegram_user_ids) * 0.003 # ~$0.003 per user
|
||||||
|
print(f"💰 Estimated cost: ${estimated_cost:.4f}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Initialize verifier
|
||||||
|
verifier = LLMVerifier()
|
||||||
|
|
||||||
|
print("🚀 Starting LLM verification...")
|
||||||
|
print(f"⚡ Concurrent requests: {concurrent_requests}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Configuration for parallel processing
|
||||||
|
CONCURRENT_REQUESTS = concurrent_requests # Process N users at a time
|
||||||
|
BATCH_SIZE = 50 # Save checkpoint every 50 users
|
||||||
|
|
||||||
|
# Process users in parallel batches
|
||||||
|
total_users = len(telegram_user_ids)
|
||||||
|
processed_count = 0
|
||||||
|
|
||||||
|
for batch_start in range(0, total_users, BATCH_SIZE):
|
||||||
|
batch_end = min(batch_start + BATCH_SIZE, total_users)
|
||||||
|
batch = telegram_user_ids[batch_start:batch_end]
|
||||||
|
|
||||||
|
print(f"📦 Processing batch {batch_start//BATCH_SIZE + 1}/{(total_users + BATCH_SIZE - 1)//BATCH_SIZE} ({len(batch)} users)...")
|
||||||
|
|
||||||
|
# Create tasks for concurrent processing
|
||||||
|
tasks = []
|
||||||
|
for telegram_user_id in batch:
|
||||||
|
task = verify_telegram_user(
|
||||||
|
telegram_user_id,
|
||||||
|
verifier,
|
||||||
|
checkpoint_manager,
|
||||||
|
telegram_db,
|
||||||
|
telegram_conn,
|
||||||
|
log_file
|
||||||
|
)
|
||||||
|
tasks.append((telegram_user_id, task))
|
||||||
|
|
||||||
|
# Process batch concurrently with limit
|
||||||
|
semaphore = asyncio.Semaphore(CONCURRENT_REQUESTS)
|
||||||
|
|
||||||
|
async def process_with_semaphore(user_id, task):
|
||||||
|
async with semaphore:
|
||||||
|
return user_id, await task
|
||||||
|
|
||||||
|
results = await asyncio.gather(
|
||||||
|
*[process_with_semaphore(user_id, task) for user_id, task in tasks],
|
||||||
|
return_exceptions=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Process results and update checkpoints
|
||||||
|
for i, result in enumerate(results):
|
||||||
|
processed_count += 1
|
||||||
|
user_id = batch[i]
|
||||||
|
|
||||||
|
if isinstance(result, Exception):
|
||||||
|
print(f"[{processed_count}/{total_users}] ❌ User {user_id} failed: {result}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
user_id_result, verification_result = result
|
||||||
|
print(f"[{processed_count}/{total_users}] ✅ User {user_id_result}: {verification_result['matches_saved']} matches | ${verification_result['cost']:.4f}")
|
||||||
|
|
||||||
|
# Update checkpoint
|
||||||
|
checkpoint_manager.update(
|
||||||
|
user_id_result,
|
||||||
|
verification_result['matches_saved'],
|
||||||
|
verification_result['tokens'],
|
||||||
|
verification_result['cost']
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f" Batch complete. Total processed: {processed_count}/{total_users}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Final stats
|
||||||
|
print()
|
||||||
|
print("=" * 70)
|
||||||
|
print("✅ VERIFICATION COMPLETE")
|
||||||
|
print("=" * 70)
|
||||||
|
print()
|
||||||
|
print(f"📊 Statistics:")
|
||||||
|
print(f" Processed: {checkpoint_manager.data['processed_count']:,} telegram users")
|
||||||
|
print(f" 💾 Saved matches: {checkpoint_manager.data['total_matches_saved']:,}")
|
||||||
|
print()
|
||||||
|
print(f"💰 Cost:")
|
||||||
|
print(f" Total tokens: {checkpoint_manager.data['total_tokens']:,}")
|
||||||
|
print(f" Total cost: ${checkpoint_manager.data['total_cost']:.4f}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Clean up checkpoint
|
||||||
|
CHECKPOINT_FILE.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ Error: {e}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
return False
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if log_file:
|
||||||
|
log_file.close()
|
||||||
|
telegram_db.close()
|
||||||
|
telegram_conn.close()
|
||||||
|
checkpoint_manager.save()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
try:
|
||||||
|
success = asyncio.run(main())
|
||||||
|
sys.exit(0 if success else 1)
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n\n⚠️ Interrupted by user")
|
||||||
|
print("💾 Progress saved - you can resume by running this script again")
|
||||||
|
sys.exit(1)
|
||||||
Reference in New Issue
Block a user