dbi Blog

by Alexandre Nestor

In our data-centric world, efficiently searching and retrieving information is crucial. Traditional search methods frequently fail with the surge of unstructured data. Oracle 23c AI addresses this challenge with a groundbreaking feature: vector search. This innovative tool transforms data handling and retrieval, particularly with complex, high-dimensional datasets. In this blog post, we’ll explore the concept of vector search, its implementation in Oracle 23c AI, and its transformative impact across various industries.

Understanding Vector Search (short)

Vector search, also known as similarity search or nearest neighbor search, involves searching for data points in a vector space. Unlike traditional search methods that rely on exact keyword matches, vector search leverages mathematical representations of data (vectors) to find similarities. Each data point is encoded into a vector, capturing its essential features. The search process then identifies vectors that are closest to the query vector based on a chosen distance metric, such as Euclidean distance or cosine similarity.

Oracle 23c AI integrates vector search capabilities, providing users with an advanced tool for data retrieval. Here’s a closer look at how it works and its features:

Seamless Integration: Oracle 23c AI’s vector search is integrated into the database, allowing users to perform vector searches without the need for external tools or complex workflows.

High Performance: Leveraging Oracle’s robust infrastructure, vector search in 23c AI offers high-speed search capabilities even with large datasets. Advanced indexing and optimized algorithms ensure quick and accurate retrieval.

Multi-Modal Data Support: Oracle 23c AI supports various data types, including text, images, audio, and more. This versatility makes it a powerful tool for applications across different domains.

Customizable Distance Metrics: Users can choose from a variety of distance metrics based on their specific needs, enhancing the flexibility and accuracy of the search results.

What is the flow

Data Ingestion: Data is ingested into the Oracle 23c AI database, where it is preprocessed and converted into vectors. This step involves using machine learning models to encode the data’s features into numerical vectors.

Indexing: The vectors are indexed to facilitate efficient searching. Oracle 23c AI uses advanced indexing techniques, such as hierarchical navigable small world (HNSW) graphs, to enable fast and scalable searches.

Querying: When a query is made, it is also converted into a vector. The vector search algorithm then identifies the closest vectors in the database using the chosen distance metric.

Results: The search results are returned, showcasing the most similar data points to the query. These results can be further refined or analyzed based on the application’s requirements.

The classical example is the image search, and classification following some criteria.

To achieve this goal Oracle can make benefit of ONNX (Open Neural Network Exchange) existent models. ONNX models offer a standardised format for representing deep learning models.

Predefined ONNX models exist for free: https://github.com/onnx/models

Building the test environment

I use an OCI VM Oracle Linux Server 9.4

I will use Python function to generate the ONNX model.

First step is to create a virtual environnement using python:

(venv) [oracle@db23-134956 ~]$ pip freeze
certifi==2024.6.2
cffi==1.16.0
charset-normalizer==3.3.2
coloredlogs==15.0.1
contourpy==1.2.1
cryptography==42.0.8
cycler==0.12.1
filelock==3.15.4
flatbuffers==24.3.25
fonttools==4.53.0
fsspec==2024.6.1
huggingface-hub==0.23.4
humanfriendly==10.0
idna==3.7
importlib_resources==6.4.0
Jinja2==3.1.4
joblib==1.4.2
kiwisolver==1.4.5
MarkupSafe==2.1.5
matplotlib==3.9.0
mpmath==1.3.0
networkx==3.3
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu12==12.1.105
oml @ file:///home/oracle/installer/client/oml-2.0-cp312-cp312-linux_x86_64.whl#sha256=0c1f7c83256f60c87f1f66b2894098bc8fefd8a60a03e67ba873661dd178b3c2
onnx==1.16.1
onnxruntime==1.18.1
onnxruntime_extensions==0.11.0
oracledb==2.2.1
packaging==24.1
pandas==2.2.2
pillow==10.4.0
protobuf==5.27.2
pycparser==2.22
pyparsing==3.1.2
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.3
safetensors==0.4.3
scikit-learn==1.5.0
scipy==1.13.1
sentencepiece==0.2.0
setuptools==70.2.0
six==1.16.0
sympy==1.12.1
threadpoolctl==3.5.0
tokenizers==0.19.1
torch==2.3.1
tqdm==4.66.4
transformers==4.42.3
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
zipp==3.19.2

I also use Oracle Database 23ai Free as docker image.

To start the docker image I use this command, and I put datafiles and archivelogs on external mount point:

(venv) [oracle@db23-134956 ~]$ docker run  -it --name 23ai -p 1700:1521 \
-v /u02/data/DB:/opt/oracle/oradata \
-v /u02/reco/DB:/opt/oracle/reco \
-e ENABLE_ARCHIVELOG=true  \
container-registry.oracle.com/database/free

Let’s get the ONX image and built it for our usage:

(venv) [oracle@db23-134956 ~]$ python
Python 3.12.1 (main, Feb 19 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3.0.1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from oml.utils import EmbeddingModel, EmbeddingModelConfig
>>> em = EmbeddingModel(model_name="sentence-transformers/all-MiniLM-L6-v2")
>>> em.export2file("all-MiniLM-L6-v2", output_dir=".")
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 4.01MB/s]
vocab.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 1.15MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 1.52MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 1.55MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 8.23MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:00<00:00, 304MB/s]
>>> exit()
l(venv) [oracle@db23-134956 ~]$ ls
all-MiniLM-L6-v2.onnx  venv

The ONNX generated file is all-MiniLM-L6-v2.onnx

Finally I copy the ONNX generated file to a shared directory to be able to import it to the database:

(venv) [oracle@db23-134956 ~]$ cp all-MiniLM-L6-v2.onnx /u02/data/DB/FREE/FREEPDB1/dump/1AC27018093909F8E063020011AC7FF6/

Let’s connect to the database and load the ONNX model into database:

(venv) [oracle@db23-134956 ~]$ sqlplus system/oracle@localhost:1700/freepdb1

SQL*Plus: Release 23.0.0.0.0 - Production on Wed Jul 3 09:12:02 2024
Version 23.4.0.24.05

SQL> create or replace directory DATA_PUMP_DIR as '/opt/oracle/oradata/FREE/FREEPDB1/dump';

Directory created.

SQL> begin
  2  dbms_vector.load_onnx_model(
  3  directory => 'DATA_PUMP_DIR'
  4  , file_name => 'all-MiniLM-L6-v2.onnx'
  5  , model_name => 'all_minilm_l6_v2'
  6  , metadata => json('{"function" : "embedding", "embeddingOutput" : "embedding" , "input": {"input": ["DATA"]}}')
  7  );
  8  end;
  9  /

PL/SQL procedure successfully completed.

-- test the loaded model 
SQL> select
  model_name
  , mining_function
  , algorithm
  , (model_size/1024/1024) as model_size_mb
from user_mining_models
order by model_name;  2    3    4    5    6    7

MODEL_NAME
--------------------------------------------------------------------------------
MINING_FUNCTION 	       ALGORITHM		      MODEL_SIZE_MB
------------------------------ ------------------------------ -------------
ALL_MINILM_L6_V2
EMBEDDING		       ONNX				 86.4376068

The test table contains a word, the description end the generated vector:

-- create the table  
SQL> create table my_dictionary(
 id number generated always as identity
  , word varchar2(100) not null
  , description varchar2(500) not null
  , word_vector vector not null
   );

Table created.

-- create the trigger to generate the vector 
SQL> create or replace trigger my_dict_vect_build
before insert or update on my_dictionary
for each row
declare
begin
  :new.word_vector := dbms_vector_chain.utl_to_embedding(
    data => :new.word
    , params => json(q'[{"provider": "database",  2   "model": "&model_name."}]')
  );
end;
/  3    4    5    6    7    8    9   10   11
Enter value for model_name: all_minilm_l6_v2
old   8:     , params => json(q'[{"provider": "database",  2   "model": "&model_name."}]')
new   8:     , params => json(q'[{"provider": "database",  2   "model": "all_minilm_l6_v2"}]')

Trigger created.

-- create and index on the table
SQL> create vector index my_dict_ivf_idx
  on my_dictionary(word_vector)
  organization neighbor partitions
  distance cosine
  with target accuracy 95;

Index created.

I download a dictionary from the web in CSV mode as word | description from https://www.bragitoff.com/2016/03/english-dictionary-in-csv-format and I formatted it to be somehow like this :

Score;q'[To engrave, as upon a shield.]'
Score;q'[To make a score of, as points, runs, etc., in a game.]'
Score;q'[To write down in proper order and arrangement  as, to score an overture for an orchestra. See Score, n., 9.]'
Score;q'[To mark with parallel lines or scratches  as, the rocks of New England and the Western States were scored in the drift epoch.]'
Scorer;q'[One who, or that which, scores.]'

The I loaded the dictionary into a table dict using sqlldr:

SQL> 

SQL> desc dict;
Name Null? Type

WORD VARCHAR2(100)
DESCRIPTION NOT NULL VARCHAR2(500)

Using these parameters files: 

[oracle@0c97e3a7be16 ]$ cat ld.par
control=ld.ctl
log=ld.log
bad=ld.bad
data=ld.csv
direct=true
errors=1000000

[oracle@0c97e3a7be16 ]$ cat ld.ctl 
load data into table dict
insert
fields terminated by ";"
(word,description
)

[oracle@0c97e3a7be16 1AC27018093909F8E063020011AC7FF6]$  sqlldr 'system/"Hello-World-123"@freepdb1' parfile=ld.par

SQL*Loader: Release 23.0.0.0.0 - Production on Tue Jul 9 09:47:59 2024
Version 23.4.0.24.05

Copyright (c) 1982, 2024, Oracle and/or its affiliates.  All rights reserved.

Path used:      Direct

Load completed - logical record count 54555.

Table DICT:
  19080 Rows successfully loaded.

Check the log file:
  ld.log
for more information about the load.

SQL> select count(*) from dict;

  COUNT(*)
----------
     19080

Finally insert into the final table my_dictionars. The trigger will build the vector information

SQL> insert into my_dictionary(word,DESCRIPTION)  select WORD,DESCRIPTION from dict;

19080 rows created.

SQL> commit;

Commit complete.

Let’s test the miraculous AI.

The ONNX model used was designed to search words. Obviously more the model is complicates mode the research is precise, and more is expensive.

The question is what are the first x words from my build dictionary which are most closely to a given word. As My dictionary contains words similar I used distinct to filter the final result.

Let’s search the result for the word “Sun”:

SQL> define search_term="Sun";

select distinct word from (
with subject as (
  select to_vector(
  vector_embedding(&model_name. using '&search_term.' as data)
  ) as search_vector
  )
  select o.word from my_dictionary o, subject s order by cosine_distance(
    o.word_vector, s.search_vector)
  fetch approx first 50 rows only with target accuracy 80);  2    3    4    5    6    7    8    9
old   4:   vector_embedding(&model_name. using '&search_term.' as data)
new   4:   vector_embedding(all_minilm_l6_v2 using 'Sun' as data)

WORD
--------------------------------------------------------------------------------
Radiant
Radiate
Light
Radiancy
Earth shine
Heat

6 rows selected.

Test for the word “Seed”:

SQL> define search_term="Seed";
SQL> /
old   4:   vector_embedding(&model_name. using '&search_term.' as data)
new   4:   vector_embedding(all_minilm_l6_v2 using 'Seed' as data)

WORD
--------------------------------------------------------------------------------
Germinate
Germination
Feeder
Grow
Germinal
Heartseed
Grass
Fertile
Fertilization
Grass-grown
Harvest
Fruit
Grower
Packfong
Pack
Agnation
Gusset

17 rows selected.

The result’s is pretty cool

Conclusion

The AI is here, everyone is using it. We are going to use it more and more. Artificial Intelligence (AI) has rapidly evolved from a theoretical concept into a transformative force across numerous industries, redefining how we interact with technology and fundamentally altering societal structures.

This example is limited, the ONNX model is simple, but shows how easy we can implemented a solution which can project into the AI world.

Post Views: 2,964

Exploring Vector Search in Oracle 23c AI

Understanding Vector Search (short)

What is the flow

Building the test environment

Let’s test the miraculous AI.

Conclusion

Oracle Team

Leave a Reply:

Related blog articles

Understanding Vector Search (short)

What is the flow

Building the test environment

Let’s test the miraculous AI.

Conclusion

Oracle Team

Leave a Reply:

Related blog articles

Oracle DMK release 2.2.0 new features / Windows support.

ExaCC/CS: modify database parameters in batch mode using dbaascli

Optimize materialization of Exadata PDB sparse clones

Different types of Oracle wallets