mirror of https://github.com/lockin-bot/ProfileMatching.git synced 2026-01-12 09:44:30 +08:00

Files

Andrew Jiang 5319d4d868 Initial commit: Twitter-Telegram Profile Matching System

This module provides comprehensive Twitter-to-Telegram profile matching
and verification using 10 different matching methods and LLM verification.

Features:
- 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names)
- URL resolution integration for t.co → t.me links
- Async LLM verification with GPT-5-mini
- Interactive menu system with real-time stats
- Threaded candidate finding (~1.5 contacts/sec)
- Comprehensive documentation and guides

Key Components:
- find_twitter_candidates.py: Core matching logic (10 methods)
- find_twitter_candidates_threaded.py: Threaded implementation
- verify_twitter_matches_v2.py: LLM verification (V5 prompt)
- review_match_quality.py: Analysis and quality review
- main.py: Interactive menu system
- Complete documentation (README, CHANGELOG, QUICKSTART)

Performance:
- Candidate finding: ~16-18 hours for 43K contacts
- LLM verification: ~23 hours for 43K users
- Cost: ~$130 for full verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-04 22:56:25 -08:00

6.4 KiB

Raw Permalink Blame History

Twitter-Telegram Profile Matching System

A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.

Overview

This system operates in two main steps:

Candidate Finding: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
LLM Verification: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)

Quick Start

cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
python3.10 main.py

Matching Methods

The system uses 10 different methods to find Twitter candidates:

High Confidence Methods (0.90-0.95)

Phash Match (0.95 for exact, 0.88 for distance=1)
- Compares profile picture hashes
- Pre-computed in telegram_twitter_phash_matches table
Exact Bio Handle (0.95)
- Extracts Twitter handles from Telegram bio
- Patterns: @username, twitter.com/username, x.com/username
Bio URL Resolution (0.95) ⭐ NEW
- Twitter bio contains shortened URL (t.co/xyz) that resolves to t.me/username
- Queries url_resolution_queue table
- Captures matches even when usernames differ
Twitter Bio Has Telegram (0.92)
- Reverse lookup: Twitter bio mentions Telegram username
- Searches for @username, t.me/username, telegram.me/username
Display Name Containment (0.92)
- Telegram name contained within Twitter display name
Exact Username (0.90)
- Telegram username exactly matches Twitter username

Medium Confidence Methods (0.80-0.88)

TG Username in Twitter Name (0.88)
Twitter Username in TG Name (0.86)
Fuzzy Name (0.65-0.85)
- PostgreSQL trigram similarity with 0.65 threshold
Username Variation (0.80)
- Generates variations (remove underscores, flip numbers, etc.)

LLM Verification

The system uses GPT-5-mini with a sophisticated V6 prompt that:

Evaluates ALL candidates together (comparative evaluation)
Applies differential scoring (only one can be "most likely")
Distinguishes between personal and company accounts
Considers signal strength holistically
Only saves matches with 70%+ confidence

Files

Core Scripts

main.py - Interactive menu for running the system
find_twitter_candidates.py - Core matching logic (TwitterMatcher class)
find_twitter_candidates_threaded.py - Threaded implementation (RECOMMENDED)
verify_twitter_matches_v2.py - LLM verification with async (RECOMMENDED)
review_match_quality.py - Analyze match quality and statistics

Database Schema

setup_twitter_matching_schema.sql - Database tables and indexes

Database Tables

`twitter_match_candidates`

Stores all potential matches found by the matching methods.

Key fields:

telegram_user_id - Telegram contact user ID
twitter_id - Twitter profile ID
match_method - Which method found this candidate
baseline_confidence - Initial confidence (0.0-1.0)
match_signals - JSON with match details
llm_processed - Whether LLM has evaluated this candidate

`twitter_telegram_matches`

Stores verified matches (70%+ confidence from LLM).

Key fields:

telegram_user_id - Telegram contact
twitter_id - Matched Twitter profile
final_confidence - LLM-assigned confidence (0.70-1.0)
llm_verdict - LLM reasoning
match_method - Original matching method
matched_at - Timestamp

`url_resolution_queue`

Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).

Key fields:

twitter_id - Twitter profile ID
original_url - Shortened URL (e.g., t.co/abc)
resolved_url - Full URL (e.g., https://t.me/username)
telegram_handles - Extracted Telegram handles (JSONB array)

Usage Examples

Find Candidates for All Contacts (Threaded)

python3.10 find_twitter_candidates_threaded.py --workers 8

Find Candidates for First 1000 Contacts

python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8

Verify Matches with LLM (100 concurrent requests)

python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100

Test Mode (50 users, 10 concurrent)

python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10

Review Match Quality

python3.10 review_match_quality.py

Performance

Candidate Finding (Threaded)

Speed: ~1.5 contacts/sec
Time for 43K contacts: ~16-18 hours
Workers: 8 (default, configurable)

LLM Verification (Async)

Speed: ~32 users/minute with 100 concurrent requests
Cost: ~$0.003 per user (GPT-5-mini)
Time for 43K users: ~23 hours

Recent Improvements

V6 Prompt (Latest)

Upfront directive for comparative evaluation
Clear signal strength hierarchy
Company vs personal account differentiation
Streamlined from ~135 to ~90 lines while being clearer

URL Resolution Integration

Added Method 5b: Bio URL resolution
Captures 140+ additional matches
Especially valuable when usernames differ
0.95 baseline confidence (very high)

Configuration

Environment variables (in /Users/andrewjiang/Bao/TimeToLockIn/Profile/.env):

OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-5-mini

Database connections:

telegram_contacts - Telegram contact data
twitter_data - Twitter profile data

Tips

Always run threaded candidate finding - 10-20x faster than single-threaded
Use high concurrency for LLM verification - 100+ concurrent requests for optimal speed
Monitor costs - Check OpenAI usage during verification
Review match quality periodically - Use review_match_quality.py to analyze results
Test first - Use --test --limit 50 flags before full runs

Troubleshooting

LLM verification is slow

Increase --concurrent parameter (try 100-200)
Check OpenAI rate limits (1,000 RPM for Tier 1)

Many low-quality matches

Review and adjust V6 prompt in verify_twitter_matches_v2.py
Check review_match_quality.py for insights

Missing obvious matches

Check if candidate was found: Query twitter_match_candidates
If not found, may need new matching method
If found but not verified, check LLM reasoning in llm_verdict

Future Enhancements

Add more matching methods (location, bio keywords, etc.)
Implement feedback loop for prompt improvement
Add manual review interface for borderline matches
Export matches to various formats

6.4 KiB Raw Permalink Blame History

Twitter-Telegram Profile Matching System

Overview

Quick Start

Matching Methods

High Confidence Methods (0.90-0.95)

Medium Confidence Methods (0.80-0.88)

LLM Verification

Files

Core Scripts

Database Schema

Database Tables

twitter_match_candidates

twitter_telegram_matches

url_resolution_queue

Usage Examples

Find Candidates for All Contacts (Threaded)

Find Candidates for First 1000 Contacts

Verify Matches with LLM (100 concurrent requests)

Test Mode (50 users, 10 concurrent)

Review Match Quality

Performance

Candidate Finding (Threaded)

LLM Verification (Async)

Recent Improvements

V6 Prompt (Latest)

URL Resolution Integration

Configuration

Tips

Troubleshooting

LLM verification is slow

Many low-quality matches

Missing obvious matches

Future Enhancements

6.4 KiB

Raw Permalink Blame History

`twitter_match_candidates`

`twitter_telegram_matches`

`url_resolution_queue`