[Practical combat] Use Lucene.Net + Jieba.NET to build a lightweight in-site search

Little scum · Posted on 10/29/2023 6:05:43 PM

Requirements: Full-text search on the site is a commonly used function, and it is commonly used based on itElasticSearch、SolrDeployment and development, and now two new ones have been releasedRedisSearch、MeiliSearchSearch engine, the first two are heavier, although the last two are not as heavy as the first two, but still need to deploy dependent services, this article uses Lucene.Net + Jieba.NET to build a lightweight on-site search.

Lucene.Net

Lucene.Net is a .NET port of Lucene and is an open-source full-text search engine development kit, i.e. it is not a full full-text search engine, but a full-text search engine architecture that provides a complete query engine and indexing engine.

Site:The hyperlink login is visible.
GitHub address:The hyperlink login is visible.

Jieba.NET

Jieba.NET is a participle Chinese jieba. .NET version (C# implementation). It can perform functions such as word segmentation, part-of-speech annotation, keyword extraction, etc. for Chinese text, and supports custom dictionaries.

GitHub address:The hyperlink login is visible.

First, let's take a look at the renderings:

Customize the Analyzer

Analyzer, TokenStream, Tokenizer, TokenFilterTokenStream in Lucene TokenStream: A stream obtained after the tokenizer has been processed. This stream stores various information about word segmentation, which can be effectively obtained through TokenStream. The following is the process of converting the file stream into a TokenStream First, use the Tokenizer to segment the words, different tokenizers have different tokenziers, after the tokenzier is separated, filter the data that has been divided into words through the TokenFilter, such as stop words. After filtering, combine all the data into a TokenStream.

Reference:

The hyperlink login is visible.
The hyperlink login is visible.
The hyperlink login is visible.

To customize the Lucene analyzer based on Jieba.NET, the first reference is as follows:

Login is visible.

Refer directly to the Lucene.Net.Analysis.Common package, as Lucene.Net.Analysis.Common relies on the Lucene.Net package to download automatically.

Create a new JiebaTokenizer.cs with the following code:

Login is visible.

Create a new JiebaAnalyzer.cs with the following code:

Login is visible.

Lucene.Net Create new documents and searches

Add data from a website to Lucene on a regular or triggered basis, and Lucene stores the document on a physical disk through the analyzer, and then calls the search interface to find it.

The LuceneHelper help class code is as follows:

Login is visible.

Store storage

Store.YES: It will not only index the data, but also save the data, so that the search results can return field information.
Store.NO: Only the data will be indexed, and the data will not be saved, and the search results cannot obtain this field information.Saves disk space；

As shown below:

Field field type

The field types in Lucene.Net are Int32Field, Int64Field, SingleField, DoubleField, BinaryDocValuesField, NumericDocValuesField, SortedDocValuesField, StringField, TextField, StoredField, Please use the appropriate data type according to your situation.

TextField vs. StringField

TextField will definitely be lexicalized, StringField will not perform lexical analysis of the content stored in doc, refer to:The hyperlink login is visible.

Occur compound search

There are 6 combinations of the following:

1. MUST and MUST: Get the intersection of consecutive query clauses.
2. MUST and MUST_NOT: Indicates that the search results of the query clause corresponding to the MUST_NOT cannot be included in the query results.
3.SHOULD and MUST_NOT: When used continuously, the function is the same as MUST and MUST_NOT.
4. When SHOULD and MUST are used together, the result is the search result of the MUST clause, but SHOULD can affect the sorting.
5. SHOULD and SHOULD: Indicates the relationship between "or", and the final search result is the union of all search clauses.
6.MUST_NOT and MUST_NOT: meaningless, retrieval without results.

Run the project

At this point, you can start the project to add, update, and search for the interface code as follows:

Login is visible.

Errors may be reported as follows:

An unhandled exception occurred while processing the request.
DirectoryNotFoundException: Could not find a part of the path 'C:\Users\itsvse_nuc11\source\repos\DiscuzSearch\DiscuzSearch\bin\Debug\net6.0\Resources\prob_trans.json'.
Microsoft.Win32.SafeHandles.SafeFileHandle.CreateFile(string fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options)

TypeInitializationException: The type initializer for 'JiebaNet.Segmenter.JiebaSegmenter' threw an exception.

jieba.net After installation, you can see the Resources directory in the packages\jieba.NET directory, which contains the dictionary and other data files required to run jieba.NET, and the easiest configuration method is to copy the entire Resources directory to the directory where the assembly is located, so that the built-in default configuration values will be used jieba.NET.

C:\Users\%USERPROFILE%\.nuget\packages\jieba.net\0.42.2\Resources

JiebaNet needs to add a configuration folder, the code is as follows:

Login is visible.

The test created 500 new documents (related to my actual situation, for reference only), and the disk occupies 119KB, as shown in the figure below:

(End)

Little scum · Posted on 10/29/2023 6:10:40 PM

Elasticsearch (ES) fails to write data to the fault solution
https://www.itsvse.com/thread-10568-1-1.html

Elasticsearch automatically cleans up indexes to free up disk space
https://www.itsvse.com/thread-10273-1-1.html

Elasticsearch-7.x uses xpack for security authentication
https://www.itsvse.com/thread-10206-1-1.html

Deploy the Elasticsearch service using Docker
https://www.itsvse.com/thread-10148-1-1.html

Elasticsearch uses elasticdump to back up and migrate data
https://www.itsvse.com/thread-10143-1-1.html

Install the standalone version of elasticsearch 7.10.2 tutorial on Windows
https://www.itsvse.com/thread-9962-1-1.html

Introduction to Elasticsearch search highlight configuration
https://www.itsvse.com/thread-9562-1-1.html

.NET/C# Use Elasticsearch debugging to view request and response information
https://www.itsvse.com/thread-9561-1-1.html

ASP.NET Core Link Trace (5) Jaeger data persists to elasticsearch
https://www.itsvse.com/thread-9553-1-1.html

Elasticsearch (ES) replicates the clone index
https://www.itsvse.com/thread-9545-1-1.html

Elasticsearch(ES) cluster health: yellow (6 of 7) status
https://www.itsvse.com/thread-9544-1-1.html

Elasticsearch(ES) cluster health: red Failure analysis
https://www.itsvse.com/thread-9543-1-1.html

Java Geolocation Information in ElasticSearch (geo_point)
https://www.itsvse.com/thread-6444-1-1.html

ElasticsearchParseException[field must be either [lat], [lon] or [geohash]]
https://www.itsvse.com/thread-6442-1-1.html

elasticsearch-mappingfield type
https://www.itsvse.com/thread-6436-1-1.html

Elasticsearch:No handler for type [string] declared on field[XX]的解决办法
https://www.itsvse.com/thread-6420-1-1.html

【Practical Action】Kibana installation tutorial for Elasticsearch
https://www.itsvse.com/thread-6400-1-1.html

Geo geographic coordinates of the Elasticsearch advanced feature family
https://www.itsvse.com/thread-6399-1-1.html

ElasticSearch compound queries must, should, must_not use
https://www.itsvse.com/thread-6334-1-1.html

Elasticsearch deletes and indexes all document data
https://www.itsvse.com/thread-6321-1-1.html

[Actual combat]. net/c# Call elasticsearch search via NEST [with source code]
https://www.itsvse.com/thread-6294-1-1.html

Causes and solutions for unassigned_shards single-node Elasticsearch
https://www.itsvse.com/thread-6193-1-1.html

Tutorial on installing elasticsearch-analysis-ik in elasticsearch-6.5.2
https://www.itsvse.com/thread-6191-1-1.html

Install the elasticsearch-6.5.2 elasticsearch-head plugin
https://www.itsvse.com/thread-6190-1-1.html

Centos 7 installation and deployment elasticsearch-6.5.2 tutorial
https://www.itsvse.com/thread-6173-1-1.html

Search engine selection: Elasticsearch vs Solr
https://www.itsvse.com/thread-6168-1-1.html

Little scum · Posted on 11/5/2023 9:27:45 PM

Search test address:https://www.itsvse.com/blog_xzz.html

[Source] [Practical combat] Use Lucene.Net + Jieba.NET to build a lightweight in-site search

Related Posts

Sections viewed