mirror of https://github.com/lockin-bot/ProfileMatching.git synced 2026-01-12 09:44:30 +08:00

Files

Andrew Jiang 5319d4d868 Initial commit: Twitter-Telegram Profile Matching System

This module provides comprehensive Twitter-to-Telegram profile matching
and verification using 10 different matching methods and LLM verification.

Features:
- 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names)
- URL resolution integration for t.co → t.me links
- Async LLM verification with GPT-5-mini
- Interactive menu system with real-time stats
- Threaded candidate finding (~1.5 contacts/sec)
- Comprehensive documentation and guides

Key Components:
- find_twitter_candidates.py: Core matching logic (10 methods)
- find_twitter_candidates_threaded.py: Threaded implementation
- verify_twitter_matches_v2.py: LLM verification (V5 prompt)
- review_match_quality.py: Analysis and quality review
- main.py: Interactive menu system
- Complete documentation (README, CHANGELOG, QUICKSTART)

Performance:
- Candidate finding: ~16-18 hours for 43K contacts
- LLM verification: ~23 hours for 43K users
- Cost: ~$130 for full verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-04 22:56:25 -08:00

5.6 KiB

Raw Permalink Blame History

ProfileMatching Quick Start Guide

🚀 Fastest Way to Get Started

cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
python3.10 main.py

This gives you an interactive menu with:

Real-time statistics
Guided workflow
Easy access to all features

cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts
python3.10 main.py
# Select option 15: "Open Profile Matching System"

📋 Typical Workflow

Step 1: Find Candidates (First Time)

If you haven't found candidates yet, run:

# From ProfileMatching folder
python3.10 find_twitter_candidates_threaded.py --workers 8

# Or use interactive menu: Option 1

Expected time: ~16-18 hours for all 43K contacts

Step 2: Verify with LLM

After candidates are found, verify them:

# From ProfileMatching folder
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100

# Or use interactive menu: Option 3

Expected time: ~23 hours for all users Cost: ~$130 for all users (GPT-5-mini at $0.003/user)

Step 3: Review Results

# From ProfileMatching folder
python3.10 review_match_quality.py

# Or use interactive menu: Option 5

🧪 Test Mode (Recommended Before Full Run)

Always test with a small batch first:

# Test with 50 users
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10

# Or use interactive menu: Option 4

This helps you:

Verify the system is working correctly
Check match quality before spending on full run
Estimate costs and timing

📊 Check Current Status

At any time, you can check where you're at:

# Launch interactive menu and select Option 6: "Show statistics only"
python3.10 main.py
# Press 6, then 0 to exit

Or query directly:

psql -d telegram_contacts -U andrewjiang -c "
SELECT
  COUNT(DISTINCT telegram_user_id) as users_with_candidates,
  COUNT(*) as total_candidates,
  COUNT(*) FILTER (WHERE llm_processed = TRUE) as processed,
  COUNT(*) FILTER (WHERE llm_processed = FALSE) as pending
FROM twitter_match_candidates;
"

🔄 Re-Running After Updates

If you've updated the LLM prompt or matching logic:

Re-find Candidates (if matching logic changed)

# Delete old candidates
psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_match_candidates CASCADE;"

# Re-run candidate finding
python3.10 find_twitter_candidates_threaded.py --workers 8

Re-verify with New Prompt (if only prompt changed)

# Reset LLM processing flag
psql -d telegram_contacts -U andrewjiang -c "UPDATE twitter_match_candidates SET llm_processed = FALSE;"

# Delete old matches
psql -d telegram_contacts -U andrewjiang -c "TRUNCATE twitter_telegram_matches;"

# Re-run verification
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100

🎯 Most Common Commands

Find candidates for first 1000 contacts (testing)

python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8

Verify matches for pending candidates

python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100

Check match quality distribution

python3.10 review_match_quality.py

Export matches to CSV (coming soon)

# Will be added in future update

💡 Pro Tips

Always use threaded candidate finding - It's 10-20x faster
Use high concurrency for verification - 100-200 concurrent requests for optimal speed
Test first - Always run with --test --limit 50 before full runs
Monitor costs - Check OpenAI dashboard during verification
Check the stats - Use Option 6 in interactive menu to monitor progress

🐛 Troubleshooting

"No candidates found"

Check if Twitter database has data: psql -d twitter_data -c "SELECT COUNT(*) FROM users;"
Check Telegram contacts: psql -d telegram_contacts -c "SELECT COUNT(*) FROM contacts;"

"LLM verification is slow"

Increase --concurrent parameter (try 150-200)
Check OpenAI rate limits in dashboard
Verify network connection

"Too many low-quality matches"

Review the V6 prompt in verify_twitter_matches_v2.py
Run review_match_quality.py to analyze
Consider adjusting confidence thresholds

"Missing obvious matches"

Check if candidate was found:

SELECT * FROM twitter_match_candidates WHERE telegram_user_id = YOUR_USER_ID;

If found but not verified, check llm_verdict field for reasoning
If not found at all, may need new matching method

📚 More Information

See README.md for complete documentation
See CHANGELOG.md for recent updates
See individual script files for command-line options

🆘 Need Help?

Common issues and solutions:

Issue	Solution
Import errors	Make sure you're using python3.10
Database connection errors	Check PostgreSQL is running: `pg_isready`
OpenAI API errors	Verify API key in `.env` file
Out of memory	Reduce concurrent requests or use batching

🎓 Understanding the Output

Candidate Finding Output

Processing contact 1000/43000 (2.3%)
Found 6 candidates for @username
  • exact_username: @username (0.90)
  • fuzzy_name: @similar_name (0.75)

LLM Verification Output

[Progress] 500/1000 users (50.0%) | 125 matches | ~$1.50 | 25.0 users/min

Match Quality Review

Total users with matches: 25,662
Total matches: 36,147
Average confidence: 0.74

Confidence Distribution:
  90%+: 12,031 matches (HIGH)
  80-89%: 1,452 matches (MEDIUM)
  70-79%: 7,505 matches (LOW)

5.6 KiB

Raw Permalink Blame History

ProfileMatching Quick Start Guide

🚀 Fastest Way to Get Started

Option 1: Interactive Menu (RECOMMENDED)

Option 2: Launch from Main UnifiedContacts Menu

📋 Typical Workflow

Step 1: Find Candidates (First Time)

Step 2: Verify with LLM

Step 3: Review Results

🧪 Test Mode (Recommended Before Full Run)

📊 Check Current Status

🔄 Re-Running After Updates

Re-find Candidates (if matching logic changed)

Re-verify with New Prompt (if only prompt changed)

🎯 Most Common Commands

Find candidates for first 1000 contacts (testing)

Verify matches for pending candidates

Check match quality distribution

Export matches to CSV (coming soon)

💡 Pro Tips

🐛 Troubleshooting

"No candidates found"

"LLM verification is slow"

"Too many low-quality matches"

"Missing obvious matches"

📚 More Information

🆘 Need Help?

🎓 Understanding the Output

Candidate Finding Output

LLM Verification Output

Match Quality Review

5.6 KiB Raw Permalink Blame History

ProfileMatching Quick Start Guide

🚀 Fastest Way to Get Started

Option 1: Interactive Menu (RECOMMENDED)

Option 2: Launch from Main UnifiedContacts Menu

📋 Typical Workflow

Step 1: Find Candidates (First Time)

Step 2: Verify with LLM

Step 3: Review Results

🧪 Test Mode (Recommended Before Full Run)

📊 Check Current Status

🔄 Re-Running After Updates

Re-find Candidates (if matching logic changed)

Re-verify with New Prompt (if only prompt changed)

🎯 Most Common Commands

Find candidates for first 1000 contacts (testing)

Verify matches for pending candidates

Check match quality distribution

Export matches to CSV (coming soon)

💡 Pro Tips

🐛 Troubleshooting

"No candidates found"

"LLM verification is slow"

"Too many low-quality matches"

"Missing obvious matches"

📚 More Information

🆘 Need Help?

🎓 Understanding the Output

Candidate Finding Output

LLM Verification Output

Match Quality Review

5.6 KiB

Raw Permalink Blame History