When using MySQL to process more than one million levels of data, there are a few common senses that must be known

Little scum · Posted on 5/11/2018 1:57:06 PM

After testing, a conditional query was performed on a table containing more than 4 million records, and the query time was as high as 40 seconds. Therefore, how to improve the efficiency of SQL statement query is very important. The following are several query statement optimization methods that are widely circulated on the Internet:
First of all, when the data volume is large, you should try to avoid scanning the full table, and consider building indexes on the columns involved in where and order by, which can greatly speed up the retrieval of data. However, there are some situations where indexing doesn't work:

1. Try to avoid using != or <> operators in the where clause, otherwise the engine will abandon the use of indexes and perform full table scanning.

2. Try to avoid null value judgment on fields in the where clause, otherwise the engine will abandon the use of indexes and perform full table scanning, such as:
   select id from t where num is null
   You can set the default value of 0 on the num, make sure there is no null value in the num column in the table, and then query like this:
   select id from t where num=0

3. Try to avoid using OR in the where clause to join conditions, otherwise the engine will abandon using the index and perform a full table scan, such as:
   select id from t where num=10 or num=20
   You can query like this:
   select id from t where num=10
   union all
   select id from t where num=20

4. The following query will also result in a full table scan:

select id from t where name like ‘%abc%’

To improve efficiency, consider full-text search.

5. In and not in should also be used with caution, otherwise it will lead to full table scanning, such as:
   select id from t where num in(1,2,3)
   For continuous values, if you can use between, don't use in:
   select id from t where num between 1 and 3

6. If you use the parameter in the where clause, it will also cause the full table to be scanned. Because SQL only resolves local variables at runtime, but the optimizer cannot defer the selection of access plans to runtime; It must be selected at compile time. However, if an access plan is established at compile time, the value of the variable is still unknown and therefore cannot be used as an input item for index selection. The following statements will be scanned in full:
   select id from t where num=@num
   You can force the query to use an index instead:
   select id from t with(index(index(index name)) where num=@num

7. Try to avoid expressing fields in the where clause, which will cause the engine to abandon using the index and perform full table scanning. For example:
   select id from t where num/2=100
   should be changed to:
   select id from t where num=100*2

8. Try to avoid performing function operations on fields in the where clause, which will cause the engine to abandon using indexes and perform full table scanning. For example:
   select id from t where substring(name,1,3)='abc' – name id that starts with abc
   select id from t where datediff(day,createdate,'2005-11-30′)=0–'2005-11-30′ generated id
   should be changed to:
   select id from t where name like ‘abc%’
   select id from t where createdate>=’2005-11-30′ and createdate<’2005-12-1′

9. Do not perform functions, arithmetic operations, or other expression operations to the left of the "=" in the where clause, otherwise the system may not be able to use the index correctly.

10. When using an index field as a condition, if the index is a composite index, then the first field in the index must be used as a condition to ensure that the system uses the index, otherwise the index will not be used, and the order of the fields should be consistent with the index order as much as possible.

11. Don't write some meaningless queries, such as generating an empty table structure:
   select col1,col2 into #t from t where 1=0
   This type of code does not return any result set, but it consumes system resources, so it should be changed to something like this:
   create table #t(…)

12. Many times it is a good choice to use exists instead of in:
   select num from a where num in(select num from b)
   Replace with the following statement:
   select num from a where exists(select 1 from b where num=a.num)

Things to pay attention to when building an index:

1. Not all indexes are valid for queries, SQL is based on the data in the table to optimize the query, when the index column has a large amount of data duplication, SQL queries may not use the index, such as a table has fields sex, male, female almost half each, then even if the index is built on sex, it will not play a role in query efficiency.

2. The more indexes are not the better, the index can certainly improve the efficiency of the corresponding select, but it also reduces the efficiency of insert and update, because the index may be rebuilt when inserting or update, so how to build an index needs to be carefully considered, depending on the specific situation. It is best not to have more than 6 indexes in a table, and if there are too many, consider whether it is necessary to build indexes on some infrequently used columns.

3. Avoid updating clustered index data columns as much as possible, because the order of clustered indexed data columns is the physical storage order of table records, and once the column value changes, it will lead to the adjustment of the order of the entire table records, which will consume considerable resources. If the application system needs to update the clustered index columns frequently, it needs to consider whether the index should be built as a clustered index.

Other points to note:

1. Try to use numeric fields, and try not to design fields that contain only numerical information as characters, which will reduce the performance of queries and connections, and increase storage overhead. This is because the engine compares each character in the string one by one when processing queries and joins, whereas for numeric types, it only needs to be compared once.

2. Do not use select * from t anywhere, replace "*" with a specific field list, and do not return any fields that are not used.

3. Try to use table variables instead of temporary tables. If the table variable contains a large amount of data, note that the index is very limited (only the primary key index).

4. Avoid frequently creating and deleting temporary tables to reduce the consumption of system table resources.

5. Temporary tables are not unusable, and using them appropriately can make certain routines more effective, for example, when you need to repeatedly reference a large table or a dataset in a commonly used table. However, for one-time events, it's best to use an export table.

6. When creating a temporary table, if the amount of data inserted at one time is large, then you can use select into instead of create table to avoid causing a large number of logs to improve speed; If the amount of data is not large, in order to ease the resources of the system table, you should create table first and then insert.

7. If a temporary table is used, be sure to explicitly delete all temporary tables at the end of the stored procedure, truncate table first, and then drop table, so as to avoid a long lock of the system table.

8. Try to avoid using the cursor, because the efficiency of the cursor is poor, if the data operated by the cursor exceeds 10,000 lines, then you should consider rewriting.

9. Before using the cursor-based method or the temporary table method, you should first look for set-based solutions to solve the problem, and the set-based method is usually more effective.

10. Like temporary tables, the cursor is not unusable. Using FAST_FORWARD cursors for small datasets is often better than other row-by-row processing methods, especially if you have to reference several tables to get the data you need. Routines that include "total" in the result set are usually faster than those that are executed with the cursor. If development time permits, both cursor-based and set-based methods can be tried to see which works better.

11. Set SET NOCOUNT ON at the beginning of all stored procedures and triggers, and set SET NOCOUNT OFF at the end. There is no need to send DONE_IN_PROC messages to the client after executing each statement of the stored procedure and trigger.

12. Try to avoid returning large data to the client, if the data volume is too large, you should consider whether the corresponding demand is reasonable.

13. Try to avoid large transaction operations and improve the concurrency ability of the system.

Five flavors of life · Posted on 5/17/2018 10:12:27 AM

Thank you for sharing

[Source] When using MySQL to process more than one million levels of data, there are a few common senses that must be known

Related Posts