ProfileMatching/README.md

# Twitter-Telegram Profile Matching System

A comprehensive system for finding and verifying Twitter-Telegram profile matches using multiple matching methods and LLM-based verification.

## Overview

This system operates in two main steps:

1. **Candidate Finding**: Discovers potential Twitter profiles that match Telegram contacts using 10 different matching methods
2. **LLM Verification**: Uses GPT to evaluate candidates and assign confidence scores (0.70-1.0)

## Quick Start

```bash
cd /Users/andrewjiang/Bao/TimeToLockIn/Profile/UnifiedContacts/ProfileMatching
python3.10 main.py
```

## Matching Methods

The system uses 10 different methods to find Twitter candidates:

### High Confidence Methods (0.90-0.95)

1. **Phash Match** (0.95 for exact, 0.88 for distance=1)
   - Compares profile picture hashes
   - Pre-computed in `telegram_twitter_phash_matches` table

2. **Exact Bio Handle** (0.95)
   - Extracts Twitter handles from Telegram bio
   - Patterns: `@username`, `twitter.com/username`, `x.com/username`

3. **Bio URL Resolution** (0.95) ⭐ NEW
   - Twitter bio contains shortened URL (t.co/xyz) that resolves to `t.me/username`
   - Queries `url_resolution_queue` table
   - Captures matches even when usernames differ

4. **Twitter Bio Has Telegram** (0.92)
   - Reverse lookup: Twitter bio mentions Telegram username
   - Searches for `@username`, `t.me/username`, `telegram.me/username`

5. **Display Name Containment** (0.92)
   - Telegram name contained within Twitter display name

6. **Exact Username** (0.90)
   - Telegram username exactly matches Twitter username

### Medium Confidence Methods (0.80-0.88)

7. **TG Username in Twitter Name** (0.88)
8. **Twitter Username in TG Name** (0.86)
9. **Fuzzy Name** (0.65-0.85)
   - PostgreSQL trigram similarity with 0.65 threshold
10. **Username Variation** (0.80)
    - Generates variations (remove underscores, flip numbers, etc.)

## LLM Verification

The system uses GPT-5-mini with a sophisticated V6 prompt that:

- Evaluates ALL candidates together (comparative evaluation)
- Applies differential scoring (only one can be "most likely")
- Distinguishes between personal and company accounts
- Considers signal strength holistically
- Only saves matches with 70%+ confidence

## Files

### Core Scripts

- `main.py` - Interactive menu for running the system
- `find_twitter_candidates.py` - Core matching logic (TwitterMatcher class)
- `find_twitter_candidates_threaded.py` - Threaded implementation (RECOMMENDED)
- `verify_twitter_matches_v2.py` - LLM verification with async (RECOMMENDED)
- `review_match_quality.py` - Analyze match quality and statistics

### Database Schema

- `setup_twitter_matching_schema.sql` - Database tables and indexes

## Database Tables

### `twitter_match_candidates`
Stores all potential matches found by the matching methods.

**Key fields:**
- `telegram_user_id` - Telegram contact user ID
- `twitter_id` - Twitter profile ID
- `match_method` - Which method found this candidate
- `baseline_confidence` - Initial confidence (0.0-1.0)
- `match_signals` - JSON with match details
- `llm_processed` - Whether LLM has evaluated this candidate

### `twitter_telegram_matches`
Stores verified matches (70%+ confidence from LLM).

**Key fields:**
- `telegram_user_id` - Telegram contact
- `twitter_id` - Matched Twitter profile
- `final_confidence` - LLM-assigned confidence (0.70-1.0)
- `llm_verdict` - LLM reasoning
- `match_method` - Original matching method
- `matched_at` - Timestamp

### `url_resolution_queue`
Maps shortened URLs in Twitter bios to resolved URLs (including Telegram links).

**Key fields:**
- `twitter_id` - Twitter profile ID
- `original_url` - Shortened URL (e.g., t.co/abc)
- `resolved_url` - Full URL (e.g., https://t.me/username)
- `telegram_handles` - Extracted Telegram handles (JSONB array)

## Usage Examples

### Find Candidates for All Contacts (Threaded)
```bash
python3.10 find_twitter_candidates_threaded.py --workers 8
```

### Find Candidates for First 1000 Contacts
```bash
python3.10 find_twitter_candidates_threaded.py --limit 1000 --workers 8
```

### Verify Matches with LLM (100 concurrent requests)
```bash
python3.10 verify_twitter_matches_v2.py --verbose --concurrent 100
```

### Test Mode (50 users, 10 concurrent)
```bash
python3.10 verify_twitter_matches_v2.py --test --limit 50 --verbose --concurrent 10
```

### Review Match Quality
```bash
python3.10 review_match_quality.py
```

## Performance

### Candidate Finding (Threaded)
- **Speed**: ~1.5 contacts/sec
- **Time for 43K contacts**: ~16-18 hours
- **Workers**: 8 (default, configurable)

### LLM Verification (Async)
- **Speed**: ~32 users/minute with 100 concurrent requests
- **Cost**: ~$0.003 per user (GPT-5-mini)
- **Time for 43K users**: ~23 hours

## Recent Improvements

### V6 Prompt (Latest)
- Upfront directive for comparative evaluation
- Clear signal strength hierarchy
- Company vs personal account differentiation
- Streamlined from ~135 to ~90 lines while being clearer

### URL Resolution Integration
- Added Method 5b: Bio URL resolution
- Captures 140+ additional matches
- Especially valuable when usernames differ
- 0.95 baseline confidence (very high)

## Configuration

Environment variables (in `/Users/andrewjiang/Bao/TimeToLockIn/Profile/.env`):
```
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-5-mini
```

Database connections:
- `telegram_contacts` - Telegram contact data
- `twitter_data` - Twitter profile data

## Tips

1. **Always run threaded candidate finding** - 10-20x faster than single-threaded
2. **Use high concurrency for LLM verification** - 100+ concurrent requests for optimal speed
3. **Monitor costs** - Check OpenAI usage during verification
4. **Review match quality periodically** - Use `review_match_quality.py` to analyze results
5. **Test first** - Use `--test --limit 50` flags before full runs

## Troubleshooting

### LLM verification is slow
- Increase `--concurrent` parameter (try 100-200)
- Check OpenAI rate limits (1,000 RPM for Tier 1)

### Many low-quality matches
- Review and adjust V6 prompt in `verify_twitter_matches_v2.py`
- Check `review_match_quality.py` for insights

### Missing obvious matches
- Check if candidate was found: Query `twitter_match_candidates`
- If not found, may need new matching method
- If found but not verified, check LLM reasoning in `llm_verdict`

## Future Enhancements

- Add more matching methods (location, bio keywords, etc.)
- Implement feedback loop for prompt improvement
- Add manual review interface for borderline matches
- Export matches to various formats