Files
ProfileMatching/README.md
Andrew Jiang 5319d4d868 Initial commit: Twitter-Telegram Profile Matching System
This module provides comprehensive Twitter-to-Telegram profile matching
and verification using 10 different matching methods and LLM verification.

Features:
- 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names)
- URL resolution integration for t.co → t.me links
- Async LLM verification with GPT-5-mini
- Interactive menu system with real-time stats
- Threaded candidate finding (~1.5 contacts/sec)
- Comprehensive documentation and guides

Key Components:
- find_twitter_candidates.py: Core matching logic (10 methods)
- find_twitter_candidates_threaded.py: Threaded implementation
- verify_twitter_matches_v2.py: LLM verification (V5 prompt)
- review_match_quality.py: Analysis and quality review
- main.py: Interactive menu system
- Complete documentation (README, CHANGELOG, QUICKSTART)

Performance:
- Candidate finding: ~16-18 hours for 43K contacts
- LLM verification: ~23 hours for 43K users
- Cost: ~$130 for full verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-04 22:56:25 -08:00

6.4 KiB

Twitter-Telegram Profile Matching System

A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.

Overview

This system operates in two main steps:

  1. Candidate Finding: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
  2. LLM Verification: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)

Quick Start

cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
python3.10 main.py

Matching Methods

The system uses 10 different methods to find Twitter candidates:

High Confidence Methods (0.90-0.95)

  1. Phash Match (0.95 for exact, 0.88 for distance=1)

    • Compares profile picture hashes
    • Pre-computed in telegram_twitter_phash_matches table
  2. Exact Bio Handle (0.95)

    • Extracts Twitter handles from Telegram bio
    • Patterns: @username, twitter.com/username, x.com/username
  3. Bio URL Resolution (0.95) NEW

    • Twitter bio contains shortened URL (t.co/xyz) that resolves to t.me/username
    • Queries url_resolution_queue table
    • Captures matches even when usernames differ
  4. Twitter Bio Has Telegram (0.92)

    • Reverse lookup: Twitter bio mentions Telegram username
    • Searches for @username, t.me/username, telegram.me/username
  5. Display Name Containment (0.92)

    • Telegram name contained within Twitter display name
  6. Exact Username (0.90)

    • Telegram username exactly matches Twitter username

Medium Confidence Methods (0.80-0.88)

  1. TG Username in Twitter Name (0.88)
  2. Twitter Username in TG Name (0.86)
  3. Fuzzy Name (0.65-0.85)
    • PostgreSQL trigram similarity with 0.65 threshold
  4. Username Variation (0.80)
    • Generates variations (remove underscores, flip numbers, etc.)

LLM Verification

The system uses GPT-5-mini with a sophisticated V6 prompt that:

  • Evaluates ALL candidates together (comparative evaluation)
  • Applies differential scoring (only one can be "most likely")
  • Distinguishes between personal and company accounts
  • Considers signal strength holistically
  • Only saves matches with 70%+ confidence

Files

Core Scripts

  • main.py - Interactive menu for running the system
  • find_twitter_candidates.py - Core matching logic (TwitterMatcher class)
  • find_twitter_candidates_threaded.py - Threaded implementation (RECOMMENDED)
  • verify_twitter_matches_v2.py - LLM verification with async (RECOMMENDED)
  • review_match_quality.py - Analyze match quality and statistics

Database Schema

  • setup_twitter_matching_schema.sql - Database tables and indexes

Database Tables

twitter_match_candidates

Stores all potential matches found by the matching methods.

Key fields:

  • telegram_user_id - Telegram contact user ID
  • twitter_id - Twitter profile ID
  • match_method - Which method found this candidate
  • baseline_confidence - Initial confidence (0.0-1.0)
  • match_signals - JSON with match details
  • llm_processed - Whether LLM has evaluated this candidate

twitter_telegram_matches

Stores verified matches (70%+ confidence from LLM).

Key fields:

  • telegram_user_id - Telegram contact
  • twitter_id - Matched Twitter profile
  • final_confidence - LLM-assigned confidence (0.70-1.0)
  • llm_verdict - LLM reasoning
  • match_method - Original matching method
  • matched_at - Timestamp

url_resolution_queue

Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).

Key fields:

  • twitter_id - Twitter profile ID
  • original_url - Shortened URL (e.g., t.co/abc)
  • resolved_url - Full URL (e.g., https://t.me/username)
  • telegram_handles - Extracted Telegram handles (JSONB array)

Usage Examples

Find Candidates for All Contacts (Threaded)

python3.10 find_twitter_candidates_threaded.py --workers 8

Find Candidates for First 1000 Contacts

python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8

Verify Matches with LLM (100 concurrent requests)

python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100

Test Mode (50 users, 10 concurrent)

python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10

Review Match Quality

python3.10 review_match_quality.py

Performance

Candidate Finding (Threaded)

  • Speed: ~1.5 contacts/sec
  • Time for 43K contacts: ~16-18 hours
  • Workers: 8 (default, configurable)

LLM Verification (Async)

  • Speed: ~32 users/minute with 100 concurrent requests
  • Cost: ~$0.003 per user (GPT-5-mini)
  • Time for 43K users: ~23 hours

Recent Improvements

V6 Prompt (Latest)

  • Upfront directive for comparative evaluation
  • Clear signal strength hierarchy
  • Company vs personal account differentiation
  • Streamlined from ~135 to ~90 lines while being clearer

URL Resolution Integration

  • Added Method 5b: Bio URL resolution
  • Captures 140+ additional matches
  • Especially valuable when usernames differ
  • 0.95 baseline confidence (very high)

Configuration

Environment variables (in /Users/andrewjiang/Bao/TimeToLockIn/Profile/.env):

OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-5-mini

Database connections:

  • telegram_contacts - Telegram contact data
  • twitter_data - Twitter profile data

Tips

  1. Always run threaded candidate finding - 10-20x faster than single-threaded
  2. Use high concurrency for LLM verification - 100+ concurrent requests for optimal speed
  3. Monitor costs - Check OpenAI usage during verification
  4. Review match quality periodically - Use review_match_quality.py to analyze results
  5. Test first - Use --test --limit 50 flags before full runs

Troubleshooting

LLM verification is slow

  • Increase --concurrent parameter (try 100-200)
  • Check OpenAI rate limits (1,000 RPM for Tier 1)

Many low-quality matches

  • Review and adjust V6 prompt in verify_twitter_matches_v2.py
  • Check review_match_quality.py for insights

Missing obvious matches

  • Check if candidate was found: Query twitter_match_candidates
  • If not found, may need new matching method
  • If found but not verified, check LLM reasoning in llm_verdict

Future Enhancements

  • Add more matching methods (location, bio keywords, etc.)
  • Implement feedback loop for prompt improvement
  • Add manual review interface for borderline matches
  • Export matches to various formats