Cherry-picked from #1029
## Description of changes
*Summarize the changes made by this PR.*
- Improvements & Bug fixes
- Added support for `$in` and `$nin` metadata filters
> Note: See CIP in `docs/` or example notebook for more info
## Test plan
*How are these changes tested?*
- [x] Tests pass locally with `pytest` for python
## Documentation Changes
TBD
---------
Co-authored-by: Hammad Bashir <HammadB@users.noreply.github.com>
Note: Cherry-picked from original PR #1010
Refs: 1009
## Description of changes
*Summarize the changes made by this PR.*
- Improvements & Bug fixes
- Fixed an issue where the collection's persistent segment dir was not
removed; thus, many dirs with segment data were left on the device.
## Test plan
*How are these changes tested?*
- [x] Tests pass locally with `pytest` for python
## Documentation Changes
N/A
## Description of changes
*Summarize the changes made by this PR.*
- Improvements & Bug fixes
- Similar to #958 this makes the read path of metadata segment properly
use the index, leading to a >10x speedup in query performance while
using get(). Since we want $contains to support substring matches, it
must use the trigram tokenizer. LIKE based substring matching is only
supported by the trigram tokenizer. Before this was doing a full table
scan. See https://sqlite.org/fts5.html#the_trigram_tokenizer
- Adds a migration to use trigram tokenizer
- Add validation to disallow empty where document.
- Change the FTS index to rely on rowid for uniqueness, instead of
deleting speculatively on write path, rely on integrity checks.
- Remove LIKE escaping since its not supported by FTS and degrades to
full table scans.
- Update tests to treat _ and % as wildcards.
## Test plan
Existing tests
## Documentation Changes
Clarify that where filtering will ignore _ and %.
## Description of changes
Previously we were not using the FTS search index correctly.
https://sqlite.org/fts5.html#full_text_query_syntax Expects that you
query using the table name of the FTS table, not using the column name.
If you want to query by column name, you have to use column filters as
discussed in the link above. We opt to take the path suggested here
https://sqlite.org/forum/forumpost/1d45a7f6e17a3460 and match on id in
addition to filtering that specific column. The query planner leverages
this appropriately as confirmed in EXPLAIN.
Since we were doing speculative delete queries, assuming the index was
leveraged, this was incredibly slow. However now it is much faster.
Explain Before
```-- SCAN VIRTUAL TABLE INDEX 0:``` -> Full table scan.
Explain After
``` -- SCAN VIRTUAL TABLE INDEX 0:M2 ``` -> Scans the index itself
The net effect of this is a large increase in write speed and also now
the write path time does not grow with table size.
### Quick Benchmark Results
N = 100k uniformly random vectors
D = 128
Metadata = one small key: value pair
Document = randomly generated string of length 100
Added with batch size = 1000
**Without Fix, Overall Time = 469s. Time to add a batch grows linearly
to >8000 ms**
<img width="590" alt="Screenshot 2023-08-09 at 5 53 24 PM"
src="https://github.com/chroma-core/chroma/assets/5598697/89dde745-9231-4f3f-b62c-bf8486f7e970">
**With Fix, Overall Time = 102s. Time to add a batch grows sublinearly
to ~1200 ms**
<img width="587" alt="Screenshot 2023-08-09 at 5 43 12 PM"
src="https://github.com/chroma-core/chroma/assets/5598697/2a771788-e5d9-4afe-bacb-dfbfb51b6cd1">
We will also want to make sure that the read path leverages this way of
querying. Will address that in a follow up PR.
## Test plan
Existing tests cover the scope of this change.
## Documentation Changes
None required.
## Description of changes
This diff adds support of metadata filtering on boolean values
*Summarize the changes made by this PR.*
- Improvements & Bug fixes
- This diff add support for metadata filtering on boolean values.
Previously, this function is broken.
- New functionality
- ...
## Test plan
*How are these changes tested?*
- Enhance unit tests in test_metadata.py
- Enhance property test for test_filtering.py
## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*
- There is no public API change in this diff
---------
Co-authored-by: Liquan Pei <liquanpei@Liquans-MacBook-Pro.local>
## Description of changes
New API implementation backed by the segment-based architecture. Should
be extensible to a full distributed architecture.
---------
Co-authored-by: Jeffrey Huber <jeff@trychroma.com>
Co-authored-by: hammadb <hammad@trychroma.com>
Co-authored-by: Anton Troynikov <atroyn@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Hammad Bashir <HammadB@users.noreply.github.com>
## Description of changes
Follows the strategy described in the Message ID Serialization ADR.
## Test plan
Includes Hypothesis tests.
## Documentation Changes
N/A
---------
Co-authored-by: Jeffrey Huber <jeff@trychroma.com>
Co-authored-by: hammadb <hammad@trychroma.com>