Recently, there is a problem that needs to be dealt with, that is, to compare the string entered by the user and the string obtained by the system, if the error is not very large, the system will consider it to meet the requirements, and it is best to set a threshold. When engaging in CAPTCHA recognition, you need to compare the similarity of character codes and use the "edit distance algorithm" to make a record of the principle and C# implementation.
According to Baidu Encyclopedia:
Edit distance, also known as Levenshtein distance (also known as edit distance), is the minimum number of edits required to switch from one to another between two strings, and the greater the distance, the more different they are. Permissioned editing operations include replacing one character with another, inserting a character, and removing a character.
For example, convert the word kitten to sitting:
sitten (k→s)
sittin (e→i)
sitting (→g)
Russian scientist Vladimir Levenshtein proposed this concept in 1965. Hence the name Levenshtein Distance.
For example
If str1="ivan", str2="ivan", then it is calculated to be equal to 0. Not converted. similarity=1-0/Math.Max(str1.length,str2.length)=1 If str1="ivan1", str2="ivan2", then it is calculated to be equal to 1. The "1" of str1 converts to "2", converts a character, so the distance is 1, similarity = 1-1/Math.Max(str1.length, str2.length) = 0.8
Application:
- DNA analysis
- Spell check
- Speech recognition
- Plagiarism detection
The algorithm is implemented in C#:
Test code:
From the test results, it is concluded thatspaceorPunctuation、String positionDifferent citiesResults that affect similarityTherefore, when comparing string recognition, it is recommended to remove all spaces and special symbols in the string before calling the algorithm。
Resources:The hyperlink login is visible.
On GitHub, there is also a library for C# string similarity comparisons
FuzzyString is a library developed for my daily work to coordinate naming conventions between different grid models. I've stripped the power system-specific code and put together what can be effectively used as a string extension to determine the approximate equality between the two strings. All the algorithms used here have been extracted from online sources, converted to C#, and compiled into this library. I found several other similar open source implementations that are not available for . NET / C#。 Adding *.dll to your project will give you access to this extension and the individual extensions under the ApproximatelyEquals() extension.
Address:The hyperlink login is visible.
nuget install:
Algorithms included in this project:
- Hamming distance
- Jaccard distance
- Jaro distance
- Jaro-Winkler distance
- Levenshtein distance
- The longest public
- The longest common substring of the subsequence
- Overlap coefficient
- Ratcliff-Obershelp similarity
- Sorensen-Dice distance
- Tanimoto coefficient
Use:
Outcome:
(End)
|