Files
ProfileMatching/CHANGELOG.md
Andrew Jiang 5319d4d868 Initial commit: Twitter-Telegram Profile Matching System
This module provides comprehensive Twitter-to-Telegram profile matching
and verification using 10 different matching methods and LLM verification.

Features:
- 10 matching methods (phash, usernames, bio handles, URL resolution, fuzzy names)
- URL resolution integration for t.co → t.me links
- Async LLM verification with GPT-5-mini
- Interactive menu system with real-time stats
- Threaded candidate finding (~1.5 contacts/sec)
- Comprehensive documentation and guides

Key Components:
- find_twitter_candidates.py: Core matching logic (10 methods)
- find_twitter_candidates_threaded.py: Threaded implementation
- verify_twitter_matches_v2.py: LLM verification (V5 prompt)
- review_match_quality.py: Analysis and quality review
- main.py: Interactive menu system
- Complete documentation (README, CHANGELOG, QUICKSTART)

Performance:
- Candidate finding: ~16-18 hours for 43K contacts
- LLM verification: ~23 hours for 43K users
- Cost: ~$130 for full verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-04 22:56:25 -08:00

5.6 KiB

ProfileMatching Changelog

2025-11-04 - Initial Module Creation & URL Resolution Integration

Module Organization

  • Created dedicated ProfileMatching/ folder within UnifiedContacts
  • Bundled all matching and verification scripts together
  • Added comprehensive interactive main.py with menu system
  • Added detailed README.md documentation

Major Enhancement: URL Resolution Integration

Problem: Missing matches where Twitter usernames differ from Telegram usernames, even when Twitter profile explicitly links to Telegram.

Solution: Integrated url_resolution_queue table data into candidate finding.

Implementation:

  • Added new method find_by_resolved_url() to TwitterMatcher class (find_twitter_candidates.py:339-378)
  • Queries Twitter profiles where bio URLs (t.co/xyz) resolve to t.me/{username}
  • Integrated as Method 5b in candidate finding pipeline (find_twitter_candidates.py:554-574)
  • Baseline confidence: 0.95 (very high - explicit link in bio)

Impact:

  • 140+ potential new matches identified
  • Captures matches like Twitter @Block_Flash → Telegram @bull_flash
  • Especially valuable when usernames differ but user explicitly links profiles

Example Match:

Twitter: @Block_Flash (ID: 63429948)
Telegram: @bull_flash
Method: twitter_bio_url_resolves_to_telegram
Confidence: 0.95
Resolved URL: https://t.me/bull_flash
Original URL: https://t.co/dc3iztSG9B

URL Resolution Data Source

The url_resolution_queue table contains:

  • 16,133 Twitter profiles with resolved Telegram URLs
  • Shortened URLs from Twitter bios (t.co/xyz)
  • Resolved destinations (full URLs)
  • Extracted Telegram handles (JSONB array)

Files Modified

  1. find_twitter_candidates.py

    • Added find_by_resolved_url() method
    • Integrated into find_candidates_for_contact() as Method 5b
    • Fixed type casting for twitter_id (VARCHAR) to user.id (BIGINT) join
  2. main.py (ProfileMatching module)

    • Created comprehensive interactive menu
    • Real-time statistics display
    • Streamlined workflow for candidate finding and LLM verification
  3. README.md

    • Complete documentation of all 10 matching methods
    • Usage examples and performance metrics
    • Configuration and troubleshooting guides
  4. UnifiedContacts/main.py

    • Added option 15: "Open Profile Matching System (Interactive Menu)"
    • Reorganized menu to separate ProfileMatching from individual Twitter steps

Current Statistics (as of 2025-11-04)

Candidates:

  • Users with candidates: 38,121
  • Total candidates found: 253,117
  • Processed by LLM: 253,085
  • Pending verification: 32

Verified Matches:

  • Users with matches: 25,662
  • Total matches: 36,147
  • Average confidence: 0.74
  • High confidence (90%+): 12,031
  • Medium confidence (80-89%): 1,452
  • Low confidence (70-79%): 7,505

LLM Verification (V6 Prompt)

Current prompt improvements:

  • Upfront directive for comparative evaluation
  • Clear signal strength hierarchy (Very Strong, Strong Supporting, Weak, Red Flags)
  • Company vs personal account differentiation
  • Streamlined from ~135 to ~90 lines while being clearer
  • Emphasis on evaluating ALL candidates together

Performance Metrics

Candidate Finding (Threaded):

  • Speed: ~1.5 contacts/sec
  • Time for 43K contacts: ~16-18 hours
  • Workers: 8 (default)

LLM Verification (Async):

  • Speed: ~32 users/minute (100 concurrent requests)
  • Cost: ~$0.003 per user (GPT-5-mini)
  • Time for 43K users: ~23 hours

Module Structure

ProfileMatching/
├── main.py                              # Interactive menu system
├── README.md                            # Complete documentation
├── CHANGELOG.md                         # This file
├── find_twitter_candidates.py           # Core matching logic (10 methods)
├── find_twitter_candidates_threaded.py  # Threaded implementation
├── verify_twitter_matches_v2.py         # LLM verification (V6 prompt)
├── review_match_quality.py              # Analysis tools
└── setup_twitter_matching_schema.sql    # Database schema

10 Matching Methods Summary

  1. Phash Match (0.95/0.88) - Profile picture similarity
  2. Exact Bio Handle (0.95) - Twitter handle extracted from Telegram bio
  3. Bio URL Resolution (0.95) NEW - Shortened URL resolves to Telegram
  4. Twitter Bio Has Telegram (0.92) - Twitter bio mentions Telegram username
  5. Display Name Containment (0.92) - TG name in TW name
  6. Exact Username (0.90) - Usernames match exactly
  7. TG Username in Twitter Name (0.88)
  8. Twitter Username in TG Name (0.86)
  9. Fuzzy Name (0.65-0.85) - Trigram similarity
  10. Username Variation (0.80) - Generated username variations

Testing

All changes tested with:

  • Standalone method testing (find_by_resolved_url)
  • Full integration testing (find_candidates_for_contact)
  • Verified deduplication works correctly
  • Confirmed matches with different usernames are captured

Next Steps

Potential future enhancements:

  • Add more matching methods (location, bio keywords, mutual connections)
  • Implement feedback loop for prompt improvement
  • Add manual review interface for borderline matches
  • Export matches to various formats
  • Additional URL resolution sources beyond Twitter bios

Migration Notes

For existing deployments:

  1. No database schema changes required
  2. Existing url_resolution_queue table is used as-is
  3. Scripts in scripts/ folder remain unchanged and functional
  4. New ProfileMatching module is additive, doesn't break existing workflows

To use new features:

  1. Use ProfileMatching/main.py instead of individual scripts
  2. Or run scripts directly from ProfileMatching folder
  3. Or update import paths to use ProfileMatching module