This article is a mirror article of machine translation, please click here to jump to the original article.

View: 1009|Reply: 0

[AI] (13) A brief introduction to vector similarity and distance

[Copy link]
Posted on 2025-3-21 13:37:09 | | | |
Requirements: Last time I wrote an article about selecting an embedding model and obtaining a vector, which algorithm should be used to calculate the vector similarity after calling the embedding model to obtain the vector value and store it in the vector database?

vector

In linear algebra, vectors are often defined in a more abstract vector space (also known as linear space). Vectors are the basic building blocks in vector space.


(Many arrows represent many vectors)

Vector similarity

Some methods for vector similarity calculation:

  • Euclidean Distance
  • Cosine Similarity
  • Pearson correlation coefficient (Pearson)
  • Adjusted Cosine
  • Hamming Distance
  • Manhattan Distance
  • Chebyshev Distance
  • Euclidean Distance


Cosine Similarity

Cosine similarity measures the similarity between two vectors by measuring the cosine value of the angle between them. The cosine value of the 0-degree angle is 1, while the cosine value of any other angle is not greater than 1; And its minimum value is -1. Thus the cosine value of the angle between the two vectors determines whether the two vectors point in roughly the same direction. When two vectors have the same pointing, the value of cosine similarity is 1; When the angle between the two vectors is 90°, the value of cosine similarity is 0. When two vectors point in opposite directions, the value of cosine similarity is -1. This result is independent of the length of the vector, only the direction of the vector's pointing. Cosine similarity is usually used in positive spaces, so the value given is between -1 and 1.

Cosine similarity uses the cosine value of the angle between two vectors in vector space as the magnitude of the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors are, which is called "cosine similarity".



Pearson Correlation Coefficient

Given two random variables X and Y, the Pearson correlation coefficient can be used to measure how correlated the two are, using the following formula:



Jaccard Coefficient

Suppose there are two sets X and Y (note that the two here are not vectors), the formula for calculation is:



Dot Product

The quantitative product, also known as scalar product and dot product, is called the inner product in Euclidean space, and the corresponding elements are multiplied and added, and the result is a scalar quantity (i.e., a number). It refers to a binary operation that accepts two vectors on the real number R and returns a real numerical scalar. It is the standard inner product of Euclidean space.

Common distances

Minkowski Distance

Minkowski Distane is a generalized expression of multiple distance measurement formulas, when p=1, Minkowski Distane is the Manhattan distance; When p=2, Minkowski Distane is the Euclidean distance; Minkowski Distane takes the form of the limit of the Chebyshev distance.



Manhattan Distance



Euclidean distance



Chebyshev Distance



Hamming Distance

In information theory, the Hemming distance between two equal strings is the number of characters in different positions corresponding to the two strings. Suppose there are two strings: x=[x1,x2,...,xn] and y=[y1,y2,...,yn], then the distance between the two is:



where II represents the indicative function, both are 1 for the same, otherwise it is 0.

KL Divergence

Given the random variable X and the two probability distributions P and Q, the KL divergence can be used to measure the difference between the two distributions using the following formula:




summary

Pip product distance and cosine similarity are often used to measure similarity in vector or text data。 It is mainly used to measure vector similarity, such as document similarity in text mining and natural language processing, or information retrieval, recommendation systems, and other fields. If you're using a modern embedding model like Sentence-BERT or other pre-trained models, the default output is usually normalized, so "Dot accumulationIt is the preferred option.

Reference:

The hyperlink login is visible.
The hyperlink login is visible.
The hyperlink login is visible.
The hyperlink login is visible.




Previous:Create a simple Maven console project
Next:【AI】(14) A brief introduction to open source vector databases
Disclaimer:
All software, programming materials or articles published by Code Farmer Network are only for learning and research purposes; The above content shall not be used for commercial or illegal purposes, otherwise, users shall bear all consequences. The information on this site comes from the Internet, and copyright disputes have nothing to do with this site. You must completely delete the above content from your computer within 24 hours of downloading. If you like the program, please support genuine software, purchase registration, and get better genuine services. If there is any infringement, please contact us by email.

Mail To:help@itsvse.com