mirror of
https://github.com/lockin-bot/ProfileMatching.git
synced 2026-01-12 18:03:22 +08:00
This module provides comprehensive Twitter-to-Telegram profile matching and verification using 10 different matching methods and LLM verification. Features: - 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names) - URL resolution integration for t.co → t.me links - Async LLM verification with GPT-5-mini - Interactive menu system with real-time stats - Threaded candidate finding (~1.5 contacts/sec) - Comprehensive documentation and guides Key Components: - find_twitter_candidates.py: Core matching logic (10 methods) - find_twitter_candidates_threaded.py: Threaded implementation - verify_twitter_matches_v2.py: LLM verification (V5 prompt) - review_match_quality.py: Analysis and quality review - main.py: Interactive menu system - Complete documentation (README, CHANGELOG, QUICKSTART) Performance: - Candidate finding: ~16-18 hours for 43K contacts - LLM verification: ~23 hours for 43K users - Cost: ~$130 for full verification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
219 lines
5.6 KiB
Markdown
219 lines
5.6 KiB
Markdown
# ProfileMatching Quick Start Guide
|
|
|
|
## 🚀 Fastest Way to Get Started
|
|
|
|
### Option 1: Interactive Menu (RECOMMENDED)
|
|
```bash
|
|
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
|
|
python3.10 main.py
|
|
```
|
|
|
|
This gives you an interactive menu with:
|
|
- Real-time statistics
|
|
- Guided workflow
|
|
- Easy access to all features
|
|
|
|
### Option 2: Launch from Main UnifiedContacts Menu
|
|
```bash
|
|
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts
|
|
python3.10 main.py
|
|
# Select option 15: "Open Profile Matching System"
|
|
```
|
|
|
|
## 📋 Typical Workflow
|
|
|
|
### Step 1: Find Candidates (First Time)
|
|
If you haven't found candidates yet, run:
|
|
|
|
```bash
|
|
# From ProfileMatching folder
|
|
python3.10 find_twitter_candidates_threaded.py --workers 8
|
|
|
|
# Or use interactive menu: Option 1
|
|
```
|
|
|
|
**Expected time**: ~16-18 hours for all 43K contacts
|
|
|
|
### Step 2: Verify with LLM
|
|
After candidates are found, verify them:
|
|
|
|
```bash
|
|
# From ProfileMatching folder
|
|
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
|
|
|
|
# Or use interactive menu: Option 3
|
|
```
|
|
|
|
**Expected time**: ~23 hours for all users
|
|
**Cost**: ~$130 for all users (GPT-5-mini at $0.003/user)
|
|
|
|
### Step 3: Review Results
|
|
```bash
|
|
# From ProfileMatching folder
|
|
python3.10 review_match_quality.py
|
|
|
|
# Or use interactive menu: Option 5
|
|
```
|
|
|
|
## 🧪 Test Mode (Recommended Before Full Run)
|
|
|
|
Always test with a small batch first:
|
|
|
|
```bash
|
|
# Test with 50 users
|
|
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
|
|
|
|
# Or use interactive menu: Option 4
|
|
```
|
|
|
|
This helps you:
|
|
- Verify the system is working correctly
|
|
- Check match quality before spending on full run
|
|
- Estimate costs and timing
|
|
|
|
## 📊 Check Current Status
|
|
|
|
At any time, you can check where you're at:
|
|
|
|
```bash
|
|
# Launch interactive menu and select Option 6: "Show statistics only"
|
|
python3.10 main.py
|
|
# Press 6, then 0 to exit
|
|
```
|
|
|
|
Or query directly:
|
|
|
|
```bash
|
|
psql -d telegram_contacts -U andrewjiang -c "
|
|
SELECT
|
|
COUNT(DISTINCT telegram_user_id) as users_with_candidates,
|
|
COUNT(*) as total_candidates,
|
|
COUNT(*) FILTER (WHERE llm_processed = TRUE) as processed,
|
|
COUNT(*) FILTER (WHERE llm_processed = FALSE) as pending
|
|
FROM twitter_match_candidates;
|
|
"
|
|
```
|
|
|
|
## 🔄 Re-Running After Updates
|
|
|
|
If you've updated the LLM prompt or matching logic:
|
|
|
|
### Re-find Candidates (if matching logic changed)
|
|
```bash
|
|
# Delete old candidates
|
|
psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_match_candidates CASCADE;"
|
|
|
|
# Re-run candidate finding
|
|
python3.10 find_twitter_candidates_threaded.py --workers 8
|
|
```
|
|
|
|
### Re-verify with New Prompt (if only prompt changed)
|
|
```bash
|
|
# Reset LLM processing flag
|
|
psql -d telegram_contacts -U andrewjiang -c "UPDATE twitter_match_candidates SET llm_processed = FALSE;"
|
|
|
|
# Delete old matches
|
|
psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_telegram_matches;"
|
|
|
|
# Re-run verification
|
|
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
|
|
```
|
|
|
|
## 🎯 Most Common Commands
|
|
|
|
### Find candidates for first 1000 contacts (testing)
|
|
```bash
|
|
python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
|
|
```
|
|
|
|
### Verify matches for pending candidates
|
|
```bash
|
|
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
|
|
```
|
|
|
|
### Check match quality distribution
|
|
```bash
|
|
python3.10 review_match_quality.py
|
|
```
|
|
|
|
### Export matches to CSV (coming soon)
|
|
```bash
|
|
# Will be added in future update
|
|
```
|
|
|
|
## 💡 Pro Tips
|
|
|
|
1. **Always use threaded candidate finding** - It's 10-20x faster
|
|
2. **Use high concurrency for verification** - 100-200 concurrent requests for optimal speed
|
|
3. **Test first** - Always run with `--test --limit 50` before full runs
|
|
4. **Monitor costs** - Check OpenAI dashboard during verification
|
|
5. **Check the stats** - Use Option 6 in interactive menu to monitor progress
|
|
|
|
## 🐛 Troubleshooting
|
|
|
|
### "No candidates found"
|
|
- Check if Twitter database has data: `psql -d twitter_data -c "SELECT COUNT(*) FROM users;"`
|
|
- Check Telegram contacts: `psql -d telegram_contacts -c "SELECT COUNT(*) FROM contacts;"`
|
|
|
|
### "LLM verification is slow"
|
|
- Increase `--concurrent` parameter (try 150-200)
|
|
- Check OpenAI rate limits in dashboard
|
|
- Verify network connection
|
|
|
|
### "Too many low-quality matches"
|
|
- Review the V6 prompt in `verify_twitter_matches_v2.py`
|
|
- Run `review_match_quality.py` to analyze
|
|
- Consider adjusting confidence thresholds
|
|
|
|
### "Missing obvious matches"
|
|
- Check if candidate was found:
|
|
```sql
|
|
SELECT * FROM twitter_match_candidates WHERE telegram_user_id = YOUR_USER_ID;
|
|
```
|
|
- If found but not verified, check `llm_verdict` field for reasoning
|
|
- If not found at all, may need new matching method
|
|
|
|
## 📚 More Information
|
|
|
|
- See `README.md` for complete documentation
|
|
- See `CHANGELOG.md` for recent updates
|
|
- See individual script files for command-line options
|
|
|
|
## 🆘 Need Help?
|
|
|
|
Common issues and solutions:
|
|
|
|
| Issue | Solution |
|
|
|-------|----------|
|
|
| Import errors | Make sure you're using python3.10 |
|
|
| Database connection errors | Check PostgreSQL is running: `pg_isready` |
|
|
| OpenAI API errors | Verify API key in `.env` file |
|
|
| Out of memory | Reduce concurrent requests or use batching |
|
|
|
|
## 🎓 Understanding the Output
|
|
|
|
### Candidate Finding Output
|
|
```
|
|
Processing contact 1000/43000 (2.3%)
|
|
Found 6 candidates for @username
|
|
• exact_username: @username (0.90)
|
|
• fuzzy_name: @similar_name (0.75)
|
|
```
|
|
|
|
### LLM Verification Output
|
|
```
|
|
[Progress] 500/1000 users (50.0%) | 125 matches | ~$1.50 | 25.0 users/min
|
|
```
|
|
|
|
### Match Quality Review
|
|
```
|
|
Total users with matches: 25,662
|
|
Total matches: 36,147
|
|
Average confidence: 0.74
|
|
|
|
Confidence Distribution:
|
|
90%+: 12,031 matches (HIGH)
|
|
80-89%: 1,452 matches (MEDIUM)
|
|
70-79%: 7,505 matches (LOW)
|
|
```
|