mirror of https://github.com/lockin-bot/ProfileMatching.git synced 2026-01-12 09:44:30 +08:00

Go to file

Andrew Jiang 5319d4d868 Initial commit: Twitter-Telegram Profile Matching System

This module provides comprehensive Twitter-to-Telegram profile matching
and verification using 10 different matching methods and LLM verification.

Features:
- 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names)
- URL resolution integration for t.co → t.me links
- Async LLM verification with GPT-5-mini
- Interactive menu system with real-time stats
- Threaded candidate finding (~1.5 contacts/sec)
- Comprehensive documentation and guides

Key Components:
- find_twitter_candidates.py: Core matching logic (10 methods)
- find_twitter_candidates_threaded.py: Threaded implementation
- verify_twitter_matches_v2.py: LLM verification (V5 prompt)
- review_match_quality.py: Analysis and quality review
- main.py: Interactive menu system
- Complete documentation (README, CHANGELOG, QUICKSTART)

Performance:
- Candidate finding: ~16-18 hours for 43K contacts
- LLM verification: ~23 hours for 43K users
- Cost: ~$130 for full verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-04 22:56:25 -08:00

.gitignore

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

CHANGELOG.md

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

find_twitter_candidates_threaded.py

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

find_twitter_candidates.py

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

main.py

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

QUICKSTART.md

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

README.md

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

review_match_quality.py

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

setup_twitter_matching_schema.sql

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

verify_twitter_matches_v2.py

Initial commit: Twitter-Telegram Profile Matching System

2025-11-04 22:56:25 -08:00

README.md

Twitter-Telegram Profile Matching System

A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.

Overview

This system operates in two main steps:

Candidate Finding: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
LLM Verification: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)

Quick Start

cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
python3.10 main.py

Matching Methods

The system uses 10 different methods to find Twitter candidates:

High Confidence Methods (0.90-0.95)

Phash Match (0.95 for exact, 0.88 for distance=1)
- Compares profile picture hashes
- Pre-computed in telegram_twitter_phash_matches table
Exact Bio Handle (0.95)
- Extracts Twitter handles from Telegram bio
- Patterns: @username, twitter.com/username, x.com/username
Bio URL Resolution (0.95) ⭐ NEW
- Twitter bio contains shortened URL (t.co/xyz) that resolves to t.me/username
- Queries url_resolution_queue table
- Captures matches even when usernames differ
Twitter Bio Has Telegram (0.92)
- Reverse lookup: Twitter bio mentions Telegram username
- Searches for @username, t.me/username, telegram.me/username
Display Name Containment (0.92)
- Telegram name contained within Twitter display name
Exact Username (0.90)
- Telegram username exactly matches Twitter username

Medium Confidence Methods (0.80-0.88)

TG Username in Twitter Name (0.88)
Twitter Username in TG Name (0.86)
Fuzzy Name (0.65-0.85)
- PostgreSQL trigram similarity with 0.65 threshold
Username Variation (0.80)
- Generates variations (remove underscores, flip numbers, etc.)

LLM Verification

The system uses GPT-5-mini with a sophisticated V6 prompt that:

Evaluates ALL candidates together (comparative evaluation)
Applies differential scoring (only one can be "most likely")
Distinguishes between personal and company accounts
Considers signal strength holistically
Only saves matches with 70%+ confidence

Files

Core Scripts

main.py - Interactive menu for running the system
find_twitter_candidates.py - Core matching logic (TwitterMatcher class)
find_twitter_candidates_threaded.py - Threaded implementation (RECOMMENDED)
verify_twitter_matches_v2.py - LLM verification with async (RECOMMENDED)
review_match_quality.py - Analyze match quality and statistics

Database Schema

setup_twitter_matching_schema.sql - Database tables and indexes

Database Tables

`twitter_match_candidates`

Stores all potential matches found by the matching methods.

Key fields:

telegram_user_id - Telegram contact user ID
twitter_id - Twitter profile ID
match_method - Which method found this candidate
baseline_confidence - Initial confidence (0.0-1.0)
match_signals - JSON with match details
llm_processed - Whether LLM has evaluated this candidate

`twitter_telegram_matches`

Stores verified matches (70%+ confidence from LLM).

Key fields:

telegram_user_id - Telegram contact
twitter_id - Matched Twitter profile
final_confidence - LLM-assigned confidence (0.70-1.0)
llm_verdict - LLM reasoning
match_method - Original matching method
matched_at - Timestamp

`url_resolution_queue`

Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).

Key fields:

twitter_id - Twitter profile ID
original_url - Shortened URL (e.g., t.co/abc)
resolved_url - Full URL (e.g., https://t.me/username)
telegram_handles - Extracted Telegram handles (JSONB array)

Usage Examples

Find Candidates for All Contacts (Threaded)

python3.10 find_twitter_candidates_threaded.py --workers 8

Find Candidates for First 1000 Contacts

python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8

Verify Matches with LLM (100 concurrent requests)

python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100

Test Mode (50 users, 10 concurrent)

python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10

Review Match Quality

python3.10 review_match_quality.py

Performance

Candidate Finding (Threaded)

Speed: ~1.5 contacts/sec
Time for 43K contacts: ~16-18 hours
Workers: 8 (default, configurable)

LLM Verification (Async)

Speed: ~32 users/minute with 100 concurrent requests
Cost: ~$0.003 per user (GPT-5-mini)
Time for 43K users: ~23 hours

Recent Improvements

V6 Prompt (Latest)

Upfront directive for comparative evaluation
Clear signal strength hierarchy
Company vs personal account differentiation
Streamlined from ~135 to ~90 lines while being clearer

URL Resolution Integration

Added Method 5b: Bio URL resolution
Captures 140+ additional matches
Especially valuable when usernames differ
0.95 baseline confidence (very high)

Configuration

Environment variables (in /Users/andrewjiang/Bao/TimeToLockIn/Profile/.env):

OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-5-mini

Database connections:

telegram_contacts - Telegram contact data
twitter_data - Twitter profile data

Tips

Always run threaded candidate finding - 10-20x faster than single-threaded
Use high concurrency for LLM verification - 100+ concurrent requests for optimal speed
Monitor costs - Check OpenAI usage during verification
Review match quality periodically - Use review_match_quality.py to analyze results
Test first - Use --test --limit 50 flags before full runs

Troubleshooting

LLM verification is slow

Increase --concurrent parameter (try 100-200)
Check OpenAI rate limits (1,000 RPM for Tier 1)

Many low-quality matches

Review and adjust V6 prompt in verify_twitter_matches_v2.py
Check review_match_quality.py for insights

Missing obvious matches

Check if candidate was found: Query twitter_match_candidates
If not found, may need new matching method
If found but not verified, check LLM reasoning in llm_verdict

Future Enhancements

Add more matching methods (location, bio keywords, etc.)
Implement feedback loop for prompt improvement
Add manual review interface for borderline matches
Export matches to various formats