Requirements: Full-text search on the site is a commonly used function, and it is commonly used based on itElasticSearch、SolrDeployment and development, and now two new ones have been releasedRedisSearch、MeiliSearchSearch engine, the first two are heavier, although the last two are not as heavy as the first two, but still need to deploy dependent services, this article uses Lucene.Net + Jieba.NET to build a lightweight on-site search.
Lucene.Net
Lucene.Net is a .NET port of Lucene and is an open-source full-text search engine development kit, i.e. it is not a full full-text search engine, but a full-text search engine architecture that provides a complete query engine and indexing engine.
Site:The hyperlink login is visible. GitHub address:The hyperlink login is visible.
Jieba.NET
Jieba.NET is a participle Chinese jieba. .NET version (C# implementation). It can perform functions such as word segmentation, part-of-speech annotation, keyword extraction, etc. for Chinese text, and supports custom dictionaries.
GitHub address:The hyperlink login is visible.
First, let's take a look at the renderings:
Customize the Analyzer
Analyzer, TokenStream, Tokenizer, TokenFilterTokenStream in Lucene TokenStream: A stream obtained after the tokenizer has been processed. This stream stores various information about word segmentation, which can be effectively obtained through TokenStream. The following is the process of converting the file stream into a TokenStream First, use the Tokenizer to segment the words, different tokenizers have different tokenziers, after the tokenzier is separated, filter the data that has been divided into words through the TokenFilter, such as stop words. After filtering, combine all the data into a TokenStream.
Reference:
The hyperlink login is visible.
The hyperlink login is visible.
The hyperlink login is visible.
To customize the Lucene analyzer based on Jieba.NET, the first reference is as follows:
Refer directly to the Lucene.Net.Analysis.Common package, as Lucene.Net.Analysis.Common relies on the Lucene.Net package to download automatically.
Create a new JiebaTokenizer.cs with the following code:
Create a new JiebaAnalyzer.cs with the following code:
Lucene.Net Create new documents and searches
Add data from a website to Lucene on a regular or triggered basis, and Lucene stores the document on a physical disk through the analyzer, and then calls the search interface to find it.
The LuceneHelper help class code is as follows:
Store storage
Store.YES: It will not only index the data, but also save the data, so that the search results can return field information. Store.NO: Only the data will be indexed, and the data will not be saved, and the search results cannot obtain this field information.Saves disk space;
As shown below:
Field field type
The field types in Lucene.Net are Int32Field, Int64Field, SingleField, DoubleField, BinaryDocValuesField, NumericDocValuesField, SortedDocValuesField, StringField, TextField, StoredField, Please use the appropriate data type according to your situation.
TextField vs. StringField
TextField will definitely be lexicalized, StringField will not perform lexical analysis of the content stored in doc, refer to:The hyperlink login is visible.
Occur compound search
There are 6 combinations of the following:
1. MUST and MUST: Get the intersection of consecutive query clauses. 2. MUST and MUST_NOT: Indicates that the search results of the query clause corresponding to the MUST_NOT cannot be included in the query results. 3.SHOULD and MUST_NOT: When used continuously, the function is the same as MUST and MUST_NOT. 4. When SHOULD and MUST are used together, the result is the search result of the MUST clause, but SHOULD can affect the sorting. 5. SHOULD and SHOULD: Indicates the relationship between "or", and the final search result is the union of all search clauses. 6.MUST_NOT and MUST_NOT: meaningless, retrieval without results.
Run the project
At this point, you can start the project to add, update, and search for the interface code as follows:
Errors may be reported as follows:
An unhandled exception occurred while processing the request.
DirectoryNotFoundException: Could not find a part of the path 'C:\Users\itsvse_nuc11\source\repos\DiscuzSearch\DiscuzSearch\bin\Debug\net6.0\Resources\prob_trans.json'. Microsoft.Win32.SafeHandles.SafeFileHandle.CreateFile(string fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options)
TypeInitializationException: The type initializer for 'JiebaNet.Segmenter.JiebaSegmenter' threw an exception. jieba.net After installation, you can see the Resources directory in the packages\jieba.NET directory, which contains the dictionary and other data files required to run jieba.NET, and the easiest configuration method is to copy the entire Resources directory to the directory where the assembly is located, so that the built-in default configuration values will be used jieba.NET.
C:\Users\%USERPROFILE%\.nuget\packages\jieba.net\0.42.2\Resources JiebaNet needs to add a configuration folder, the code is as follows:
The test created 500 new documents (related to my actual situation, for reference only), and the disk occupies 119KB, as shown in the figure below:
(End)
|