Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

396 BROUGHT TO YOU IN PARTNERSHIP WITH

CONTENTS

•  About Vector Databases

Getting Started
•  Key Concepts of Vector Databases
−  Embeddings and Dimensions
−  Distance Metrics and Similarity

With Vector Databases −  Vector Indexes


−  Scalability
−  Use Cases
•  Getting Started

•  Conclusion

MIGUEL GARCÍA LORENZO


VP OF ENGINEERING, NEXTAIL

Vector databases are specialized databases designed for scenarios Figure 1: Vector database overview
where understanding the context, similarity, or pattern is more
important than matching exact values. Leveraging the mathematics
of vectors and the principles of geometry to understand and organize
the data, these capabilities are essential to boosting the power of
analytical and generative artificial intelligence (AI).

The explosion of AI and machine learning (ML) technologies is the key


driver behind the rapid growth of vector databases in the last two
years, providing greater value via performance, agility, and cost.

Unlike other evolutions in databases, vector databases were not made


to replace any technology but to solve new cases for which there was
no existing technological alternative. The main purpose of this Refcard
is to provide a clear and accessible overview of vector databases,
outlining their importance, applications, and underlying principles.

In addition, we will use a functional example throughout to better


demonstrate key points and objectives.

ABOUT VECTOR DATABASES


A vector database is a specialized database for storing,
searching, and managing information as vectors, which are the
numerical representation of objects in a high-dimensional space
(e.g., documents, text, images, videos, audio) that capture certain
features of the object itself.

This numerical representation is called a vector embedding, or simply


embedding, which we will dive into more detail later on.

© DZONE | REFCARD | APRIL 2024 1


Weaviate
The easiest way to build and
scale AI applications
Weaviate is an open source, AI-native vector
database that helps developers create intuitive and
reliable AI-powered applications.

Why build with Weaviate

Empower all developers


Accelerate success with an accessible, open
source platform and learning resources.

Scalable multi-tenant architecture


Scale to billions of data obects with security,
reliability, and efficiency top-of-mind.

Pluggable ML modules
Simplify development with seamless
integration of LLMs and other ML
frameworks.

Native hybrid search


Get the best results by blending keyword and
vector search, without extra work.

Try Weaviate Cloud Services


(WCS) for free today. Secureð îleåible deployment
Run Weaviate where you want, how you want,
based on your business needs.
Try Now
REFCARD | GETTING STARTED WITH VECTOR DATABASES

Vector embeddings are created using ML models that are able to •  Support complex queries and APIs: Enable complex queries
translate the semantic and qualitative value of the object into a that combine vector similarity searches with traditional
numerical representation. There are a variety of ML models for each database queries.
data type, such as text, audio, image, and other embedding models. •  Security and access control: Contain built-in security features,
The use of a vector database is not a mandatory requirement to be able such as authentication and authorization, data encryption,
to generate or use vector embeddings. This is because there are many data isolation, and access control mechanisms, that are
vector index libraries focused on storing embeddings with in-memory
essential for enterprise applications and compliance with data
indexes, but vector databases are highly recommended for enterprise
protection regulations.
architectures, production, and when working with high concurrency
•  Seamless integration and SDKs: Integrate seamlessly with
and data volume.
existing data ecosystems, providing integration libraries
Nowadays, vector databases are designed to support the association of for several programming languages, a variety of APIs (e.g.,
that embedding with the object metadata, which can include a variety GraphQL, RESTful), and integrations with Apache Kafka.
of information such as the structured definition and object definition.
•  Support for CRUD operations: Vector databases allow you to
Having this information alongside vectors enables more sophisticated
add, update, and delete objects with their vectors. This is so
querying, filtering, and management of capabilities that are similar to
that users don't have to reindex the entire database when any
the queries made in traditional databases. This certainly makes vector
underlying data changes.
databases more integrable, versatile, and interpretable with end users
and within data architectures. TRADITIONAL RELATIONAL vs. VECTOR DATABASE
Traditional or relational databases are indispensable for applications
Figure 2: Metadata
requiring structured and semi-structured data that will return the exact
match to the query. These databases store the information in rows or
documents, and at the end of each row, there is a record that provides
structured information such as product attributes or customer details.

Vector databases, on the other hand, are optimized for storing and
searching through high-dimensional vector data that will return items
based on similarity metrics rather than exact matches.

Figure 3: Differences between traditional and vector databases

Vector databases are a complete system designed to manage


embeddings at scale. Here are the key differentiators and advantages
of using vector databases:

•  Persistence and durability: Allow data to be stored on disk as


well as in-memory and provide fault-tolerant features like data
replication or regular backups.
•  High availability and reliability: Operate continuously and
provide tolerance to failures and errors based on clustering and
data replication architectures.
•  Scalability: Scale horizontally across multiple nodes.
•  Optimized performance and cost effectiveness: Handle
and organize data through high-dimensional vectors that can
contain thousands of dimensions.

© DZONE | REFCARD | APRIL 2024 3 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH VECTOR DATABASES

KEY CONCEPTS OF VECTOR DATABASES Obviously, with two dimensions, we cannot capture the essence of
Using vector databases involves understanding their fundamental the products. Dimensionality plays a crucial role in how well these
concepts: embeddings, indexes, and distance and similarity. embeddings can capture the relevant features of the products. More
dimensions may provide more accuracy but also more resources in
EMBEDDINGS AND DIMENSIONS
terms of compute, memory, latency, and cost.
As we explained previously, embeddings are numerical
representations of objects that capture their semantic meaning and VECTOR EMBEDDING MODELS INTEGRATION
relationships in a high-dimensional space that includes semantic Some vector databases provide seamless integration with embedding
relationships, contextual usage, or features. This numerical models, allowing us to generate vector embeddings from raw data
representation is composed by an array of numbers in which each and seamlessly integrate ML models into database operations. This
element corresponds to a specific dimension. feature simplifies the development process and abstracts away the

Figure 4: Embedding representation complexities involved in generating and using vector embeddings for
both data insertion and querying processes.

Figure 7: Embeddings generation patterns

The number of dimensions in embeddings are so important because


each dimension corresponds to a feature that we capture from the
object. It is represented as a numerical and quantitative value, and it
also defines the dimensional map where each object will be located.

Let’s consider a simple example with a numerical representation


of words, where the words are the definition of each fashion retail
product stored in our transaction database. Imagine if we could
capture the essence of these targets with only two dimensions.

Figure 5: Array of embeddings

In Figure 6, we can see the dimensional representation of these objects


to visualize their similarity. T-shirts are closer because both are the same
product with different colors. The jacket is closer to t-shirts because
they share attributes like sleeves and a collar. Furthest to the right are
the jeans, which don't share attributes with the other products.

Figure 6: Dimensional map

© DZONE | REFCARD | APRIL 2024 4 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH VECTOR DATABASES

Table 1: Embedding generation comparative MANHATTAN DISTANCE


Manhattan distance (L1 norm) sums the absolute differences of their
WITH MODEL
EXAMPLES WITHOUT INTEGRATION
INTEGRATIONS coordinates.
Data 1. Before we can insert each We can insert each
Figure 10: Manhattan
ingestion object, we must call our Model to object directly in
generate a vector embedding. the vector database,
2. Then we can insert our data delegating the
with the vector. transformation to
the database.

Query 1. Before we run a query, we must We can run a query


call our Model to generate a vector directly in the vector
embedding from our query first. database, delegating
2. Then we can run a query with the transformation to
that vector. the database.

DISTANCE METRICS AND SIMILARITY


Distance metrics are mathematical measures and functions used to The choice of distance metric and similarity measure has a profound
determine the distance (similarity) between two elements in a vector impact on the behavior and performance of ML models; however, the
space. In the context of embeddings, distance metrics evaluate how recommendation is to use the same distance metric as the metric used
far apart two embeddings are. A similarity query search retrieves the to train the given model.
embeddings that are similar to a given input based on a distance metric;
VECTOR INDEXES
this input can be a vector embedding, text, or another object. There are
Vector indexes are specialized data structures designed to efficiently
several distance metrics. The most popular ones are the following.
store, organize, and query high-dimensional vector embeddings.
COSINE SIMILARITY These indexes provide fast search queries in a cost-effective way. There
Cosine similarity measures the cosine of the angle between two vector are several indexing strategies that are optimized for handling the
embeddings, and it's often used as a distance metric in text analysis complexity and scale of the vector space. Some examples include:
and other domains where the magnitude of the vector is less important •  Approximate nearest neighbor (ANN)
than the direction.
•  Inverted index
Figure 8: Cosine
•  Locality-sensitive hashing (LSH)

Generally, each database implements a subset of these index strategies,


and in some cases, they are customized for better performance.

SCALABILITY
Vector databases are usually highly scalable solutions that support
vertical and horizontal scaling. Horizontal scaling is based on two
fundamental strategies: sharding and replication. Both strategies are
crucial for managing large-scale and distributed databases.

EUCLIDEAN DISTANCE SHARDING


Euclidean distance measures the straight-line distance between two Sharding involves dividing a database into smaller, more manageable

points in Euclidean space. pieces called shards. Each shard contains a subset of the database's
data, making it responsible for a particular segment of the data.
Figure 9: Euclidean
Table 2: Key sharding advantages and considerations

ADVANTAGES CONSIDERATIONS

By distributing the data across Implementing sharding can


multiple servers, sharding can be complex, especially in
reduce the load on any single server, terms of data distribution,
leading to improved performance. shard management, and query
processing across shards.

TABLE CONTINUES ON NEXT PAGE

© DZONE | REFCARD | APRIL 2024 5 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH VECTOR DATABASES

Sharding allows a database to Ensuring even distribution of VECTOR DATA IN GENERATIVE AI: RETRIEVAL-
scale by adding more shards across data and avoiding hotspots AUGMENTED GENERATION
additional servers, effectively where one shard receives Generative AI and large language models (LLMs) have certain
handling more data and users significantly more queries than
limitations given they must be trained with a large amount of data.
without degradation in performance. others can be challenging.
These trainings impose high costs in terms of time, resources, and
It can be cost effective to add more Query throughput does not
money. As a result, these models are usually trained with general
servers with moderate specifications improve when adding more
than to scale up a single server with sharded nodes. contexts and are not constantly updated with the latest information.
high specifications.
Retrieval-augmented generation (RAG) plays a crucial role because
it was developed to improve the response quality in specific contexts
REPLICATION
using a technique that incorporates an external source of relevant and
Replication involves creating copies of a database on multiple nodes
updated information into the generative process. A vector database is
within the cluster.
particularly well suited for implementing RAG models due to its unique

Table 3: Key advantages and considerations for replication capabilities in handling high-dimensional data, performing efficient
similarity searches, and integrating seamlessly with AI/ML workflows.
ADVANTAGES CONSIDERATIONS
Figure 11: Overview of RAG architecture
Replication ensures that the Maintaining data consistency across
database remains available replicas, especially in write-heavy
for read operations even if environments, can be challenging
some servers are down. and may require sophisticated
synchronization mechanisms.

Replication provides a Replication requires additional storage


mechanism for disaster and network resources, as data is
recovery as data is backed duplicated across multiple servers.
up across multiple locations.

Replication can improve In asynchronous replication setups,


the read scalability of a there can be a lag between when data is
database system by allowing written to the primary index and when
read queries to be distributed it is replicated to the secondary indexes. Using vector databases in the RAG integration pattern has the following
across multiple replicas. This lag can impact applications that advantages:
require real-time or near-real-time data
consistency across replicas.
•  Semantic understanding: Vector embeddings capture the
nuanced semantic relationships within data, whether text,
images, or audio. This deep understanding is essential for
USE CASES
generative models to produce high-quality, realistic outputs
Vector databases and embeddings are crucial for several key use cases,
that are contextually relevant to the input or prompt.
including semantic search, vector data in generative AI, and more.
•  Dimensionality reduction: By representing complex data in
SEMANTIC SEARCH a lower-dimensional vector space, this is aimed to reduce vast
You can retrieve information by leveraging the capabilities of vector datasets to make it feasible for AI models to process and learn from.
embeddings to understand and match the semantic context of queries
•  Quality and precision: The precision of similarity search in
with relevant content.
vector databases ensures that the information retrieved for

Searches are performed by calculating the similarity between the generation is of high relevance and quality.

query vector and document vectors in the database, using some of the •  Seamless integration: Vector databases provide APIs, SDKs, and
previously explained metrics, such as cosine similarity. Some of the tools that make it easy to integrate with various AI/ML frameworks.
applications would be: This flexibility facilitates the development and deployment of RAG
models, allowing researchers and developers to focus on model
•  Recommendation systems: Perform similarity searches to
optimization rather than data management challenges.
find items that match a user's interests, providing accurate and
timely recommendations to enhance the user experience. •  Context generation: Vector embeddings capture the semantic
essence of text, images, videos, and more, enabling AI models
•  Customer support: Obtain the most relevant information to
to understand context and generate new content that is
solve customers' doubts, questions, or problems.
contextually similar or related.
•  Knowledge management: Find relevant information quickly
•  Scalability: Vector databases provide a scalable solution that
from the organization's knowledge composed by documents,
can manage large-scale information without compromising
slides, videos, or reports in enterprise systems.
retrieval performance.

© DZONE | REFCARD | APRIL 2024 6 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH VECTOR DATABASES

Vector databases provide the technological foundation necessary for GETTING STARTED
the effective implementation of RAG models and make them an optimal To get started, we have conducted a practical exercise below that
choice for interaction with large-scale knowledge bases. demonstrates the use of a vector database for identifying comparable
products in a fashion retail scenario (i.e., semantic search use case).
OTHER SPECIFIC USES CASES
We'll go through setting up the environment, loading fashion product
Beyond the main use cases discussed above are several others, such as:
data into the open-source vector database, and querying it to find
•  Anomaly detection: Embeddings capture nuanced relationships
similar items.
and patterns within data, making it possible to detect anomalies
that might not be evident through traditional methods. For the environment, ensure the following tools are installed:

•  Retail comparable products: By converting product features •  Docker 24 or higher


into vector embeddings, retailers can quickly find products •  Docker Compose v2
with similar characteristics (e.g., design, material, price, sales). •  Python 3.8 or higher

DATA SAMPLE
The following is a list of the datasets that we will use during this practical exercise based on the concepts explained in previous sections:

Table 4: Data sample

NAME SECTION FAMILY FIT COMPOSITION COLOR

Relaxed Fit Tee Men T-shirts Non-stretch, Relaxed fit 100% cotton. Jersey. Crewneck, Short sleeves Red

Relaxed Fit Tee Men T-shirts Non-stretch, Relaxed fit 100% cotton. Jersey. Crewneck, Short sleeves Green

Trucker Jacket Men Jackets Standard fit 100% cotton, Denim, Point collar, Long sleeves Gray

Slim Welt Pocket Jeans Women Jeans Mid rise: 8 3/4'', Inseam: 62% cotton~28% viscose, ECOVERO™)~8% elastomultiester~2% Black
30'', Leg opening: 13'' elastane, Denim, Stretch, Zip fly, 5-pocket styling

Baggy Dad Utility Pants Women Jeans Mid rise, Straight leg 95% cotton, 5% recycled cotton, Denim, No Stretch Green

The Perfect Tee Women T-shirts Standard fit, Model 100% cotton, Crewneck, Short sleeves White
wears a size small

Lelou Shrunken Moto Women Jackets Slim fit 100% polyurethane - releases plastic microfibers into the Black
Jacket environment during washing, Long sleeves

STEP 1: START UP YOUR VECTOR DATABASE


- weaviate_data:/var/lib/weaviate
In this example, we are going to use the following Docker Compose
restart: on-failure:0
file to locally run our vector database instance, using the open-source
environment:
Weaviate vector database in the following configuration: QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
---
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
version: '3.4'
DEFAULT_VECTORIZER_MODULE: 'text2vec-
services:
transformers'
weaviate:
TRANSFORMERS_INFERENCE_API: http://t2v-
command:
transformers:8080
- --host
ENABLE_MODULES: 'text2vec-transformers'
- 0.0.0.0
CLUSTER_HOSTNAME: 'node1'
- --port
t2v-transformers:
- '8080'
- --scheme image: semitechnologies/transformers-
- http inference:sentence-transformers-multi-qa-MiniLM-L6-
image: cr.weaviate.io/semitechnologies/ cos-v1
weaviate:1.24.4 environment:
ports: ENABLE_CUDA: '0'
- 8080:8080 volumes:
- 50051:50051 weaviate_data:
volumes: ...

CODE CONTINUES IN NEXT COLUMN

© DZONE | REFCARD | APRIL 2024 7 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH VECTOR DATABASES

In this example, the most relevant part is the modules' configuration:


"family": "T-SHIRTS",
•  DEFAULT_VECTORIZER_MODULE is the vectorization module, "fit": "Non-stretch, Relaxed fit",
"composition": "100% cotton. Jersey.
which transforms objects into embeddings by default (or you
Crewneck, Short sleeves",
need to enter a vector for each data point that you add manually). "color": "Green"
},
•  TRANSFORMERS_INFERENCE_API is the location of the inference
{
API where this API is located. In our case, we are running this
"name": "TRUCKER JACKET",
service in another image defined in the Docker Compose file. "section": "MEN",
"family": "JACKETS",
•  ENABLE_MODULES are enabled inside Weaviate. We are going to use
"fit": "Standard fit",
text2vec-transformer to vectorize the products' data objects. "composition": "100% cotton, Denim, Point
•  t2v-transformers is the image with "text2vec-transformer" collar, Long sleeves",
"color": "Gray"
service.
},
Once we create the Docker Compose file, all we have to do is execute it: {
"name": "SLIM WELT POCKET JEANS",
# Docker Compose runs two images the Weaviate "section": "WOMEN",
database and t2v-transformers-1 "family": "JEANS",
"fit": "Mid rise: 8 3/4'', Inseam: 30'', Leg
$ sudo docker compose up -d
opening: 13''",
"composition": "62% cotton, 28% viscose
To check if our vector database is running, we will run the following (ECOVERO™), 8% elastomultiester, 2% elastane, Denim,
commands: Stretch, Zip fly, 5-pocket styling",
"color": "Black"
# Check if the container's status is up. },
$ sudo docker ps {
"name": "BAGGY DAD UTILITY PANTS",
CONTAINER ID …
"section": "WOMEN",
STATUS
"family": "JEANS",
16dbc16744a8 …
"fit": "Mid rise, Straight leg",
Up 2 minutes
"composition": "95% cotton, 5% recycled
cb4175cec9a2 …
cotton, Denim, No Stretch",
Up 2 minutes
"color": "Green"
# Check database status by querying the API },
$ curl -X GET http://localhost:8080/v1/meta {
"name": "THE PERFECT TEE",
# In case of error, check the logs "section": "WOMEN",
$ docker compose logs -f --tail 100 weaviate "family": "T-SHIRTS",
"fit": "Standard fit, Model wears a size
small",
STEP 2: INSTALL THE CLIENT LIBRARY
"composition": "100% cotton, Crewneck, Short
Next, install the Weaviate Python client: sleeves",
"color": "White"
$ pip install weaviate-client
},
{
STEP 3: PREPARING YOUR FASHION RETAIL DATA "name": "LELOU SHRUNKEN MOTO JACKET",
"section": "WOMEN",
Prepare a dataset of fashion retail products based on Table 4. Each
"family": "JACKETS",
product should have attributes like name, description, or composition. "fit": "Slim fit",
"composition": "100% polyurethane - releases
products_data = [
plastic microfibers into the environment during
{
washing, Long sleeves",
"name": "Relaxed Fit Tee",
"color": "Black"
"section": "MEN",
}
"family": "T-SHIRTS",
]
"fit": "Non-stretch, Relaxed fit",
"composition": "100% cotton. Jersey.
Crewneck, Short sleeves", STEP 4: CREATE A COLLECTION
"color": "Red" To create a collection, we need to define the collection and schema for
},
the products' data objects. There are two options here:
{
"name": "Relaxed Fit Tee", 1. Create a schema that includes these properties
"section": "MEN",
2. Let your vector database auto-detect and generate the
CODE CONTINUES IN NEXT COLUMN properties automatically

© DZONE | REFCARD | APRIL 2024 8 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | GETTING STARTED WITH VECTOR DATABASES

In this case, we are going to use the second option, using Weaviate as
for o in response.objects:
our example: print(o.properties)
print(o.metadata.distance)
import weaviate
finally:
# Defined previously Step 3 client.close()
products_data = [{....}]
This query uses the NEAR_TEXT function to find products with
# Connect with default parameters
client = weaviate.connect_to_local() descriptions similar to the given concept. Weaviate will return
products that its AI considers semantically similar based on the vector
# Check if the connection was successful
embeddings of their descriptions.
try:
client.is_ready()
STEP 6: OUTPUT
print("Successfully connected to Weaviate.")
products_collection = client.collections.create( The output of this query returns the two closest products, including
name="Products", some of the object properties and the distance:
vectorizer_config=wvc.config.Configure.
Vectorizer.text2vec_transformers( Successfully connected to Weaviate.
vectorize_collection_name=True {'family': 'T-SHIRTS', 'color': 'Red', 'name':
) 'Relaxed Fit Tee'}
) 0.0
{'family': 'T-SHIRTS', 'color': 'White', 'name':
products_objs = list() 'THE PERFECT TEE'}
for i,d in enumerate(products_data): 0.0
products_objs.append({
"name": d["name"],
"section": d["section"], CONCLUSION
"family" : d["family"], This Refcard provides an overview of vector database fundamentals
"fit": d["fit"], as well as a practical application in fashion retail. By customizing the
"composition": d["composition"],
dataset and queries, you can explore the full potential of vector
"color": d["color"],
}) databases for similarity searches and other AI-driven applications.
This is just the starting point to get you started in the world of
products_collection.data.insert_many(products_
vectors. ML models and vectors represent powerful tools in the area
objs)
of machine learning and artificial intelligence, offering a nuanced and
finally:
high-dimensional representation of complex data. Vector databases
client.close()
are not a magical solution that provides immediate value, yet like all
good wine, engineers — and wineries alike — must employ careful
STEP 5: SIMILARITY QUERY
experimentation, parameter optimization, and ongoing evaluation.
Once your data is indexed, we can query for similar products using
Weaviate's vector search capabilities. For example, to find products
similar to a "Red T-Shirt" or "Jeans for women," you can use a search
query with its description: WRITTEN BY MIGUEL GARCÍA LORENZO,
VP OF ENGINEERING, NEXTAIL
import weaviate
Miguel is VP of Engineering at Nextail. He has 10+
import weaviate.classes as wvc years in data space leading teams and building high-
performance solutions. A book lover and advocate of
# Connect with default parameters platform design as a service and data as a product.
client = weaviate.connect_to_local()

# Check if the connection was successful


try:
3343 Perimeter Hill Dr, Suite 100
client.is_ready() Nashville, TN 37211
print("Successfully connected to Weaviate.") 888.678.0399 | 919.678.0300

products = client.collections.get("Products") At DZone, we foster a collaborative environment that empowers developers and
tech professionals to share knowledge, build skills, and solve problems through
content, code, and community. We thoughtfully — and with intention — challenge
response = products.query.near_text( the status quo and value diverse perspectives so that, as one, we can inspire
query="Red T-Shirt", positive change through technology.
return_metadata=wvc.query.
MetadataQuery(distance=True), Copyright © 2024 DZone. All rights reserved. No part of this publication may be
limit=2, reproduced, stored in a retrieval system, or transmitted, in any form or by means
of electronic, mechanical, photocopying, or otherwise, without prior written
return_properties=["name", "family", "color"] permission of the publisher.
)

CODE CONTINUES IN NEXT COLUMN

© DZONE | REFCARD | APRIL 2024 9 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like