This article is a mirror article of machine translation, please click here to jump to the original article.

View: 13098|Reply: 0

[Communication] IT application system failures are inevitable, and timely detection can be dealt with calmly

[Copy link]
Posted on 10/13/2014 10:36:01 AM | | |
Before the application system is launched, defects and hidden dangers can be greatly reduced through intensive testing, but because the simulation environment of the test cannot be exactly the same as the real environment after the system is launched, the test work cannot cover all scenarios of IT application system production and operation, and it is difficult to avoid the occurrence of IT application system failures in a specific scenario.
Since the hidden danger of failure is inevitable, it is very important to be able to deal with the fault calmly! It is best to know in advance, predict the possible problems of the IT application system, and take measures when the problem does not occur to eliminate the fault in the bud. No matter how bad it is, we must know what problems have occurred in the system and where they have occurred as soon as possible, and deal with them in time before they spread to avoid the escalation of the situation. In reality, because these two points are still difficult to do, the pressure of operation and maintenance is unprecedented!
Looking at the current enterprises with a high degree of information construction represented by banks, business development is becoming more and more dependent on IT, the complexity of their IT applications is getting higher and higher, and the controllability is getting worse and worse. But what is a headache is that in such a high-intensity chase and interception situation, system failures still occur, risks flash again and again, and many times, small problems eventually evolve into major failures, what is the reason? Why is there always a lag in discovery? Why can't various monitoring methods detect abnormalities at the first time? It is necessary to dissect this.
In terms of major aspects, the computer room is divided into two categories: basic resources and IT application systems. For a long time, we have attached great importance to basic resources such as network, host, storage, temperature and humidity of the computer room, and the monitoring methods can be described as "armed to the teeth".
For the monitoring of IT application systems, at present, domestic and foreign manufacturers and service providers provide many products or solutions, the content of monitoring has its own focus, comprehensive analysis, their practice is mainly to observe the performance of the IT application system on the basic resource layer, through network traffic, system performance, CPU busyness, memory occupation, database access, middleware status and other indicators, combined with log analysis, probe exploration, simulation access and proxy extraction and other methods to obtain certain time point information of the system operation. Roughly judge the overall operation status of a system, these products or solutions lack continuous tracking and monitoring of system operation details, so they cannot grasp the details of the operation status of each module within the IT application system and even the functional points under the module, these details include: What transactions is the system processing? Which succeeded? Which is problematic? Who initiates the transaction? When is it launched? What business do you do? Which module of the system is involved? Which function point is responsible for processing? What time does the response return? Are there any performance anomalies? If it is not successful, what is the fault? They are very important for judging the operating status of an IT application system.
In practice, at the beginning of the IT application system failure, when the fault point has little impact on the basic resources or has not yet been transmitted to the basic resource layer, or the fault occurs in the gap between the use of logs, probes, proxies and other means, although the system risk has been "undercurrent", but often the existing monitoring methods cannot play a role, and the external presentation is also "no abnormality". This is also the fundamental reason why fault detection lags behind and is difficult to deal with! It can be seen that timely detection of system failures in the "first time" is a shortcoming of the current IT operation and maintenance work, and it is of great significance to make up for IT operation and maintenance.
What is "first time"? That is, in the process of an IT application system responding to access requests, the moment a transaction fails or abnormally occurs, it must be accurately captured! Everyone knows that early detection can be dealt with in time, and in order to reverse the current passive situation of IT operation and make up for the shortcomings of IT operation and maintenance, it is necessary to technically solve the problem of detecting system failures "at the first time". Through the comparative research and practice of the operation of a large number of IT application systems, this idea is actually technically feasible, but the people in the bureau may be affected by inertial thinking, fail to jump out of the original mindset, and even think that it is not feasible in subjective consciousness, resulting in no substantive breakthrough in this aspect of work, and the operational risks of IT applications are always in a passive situation of piecemeal response.
The key to realizing the "first time" detection of system failures is to be "considerate" of the IT application system, master its every move, specifically, it is to conduct in-depth observation of the operation details of the IT application system, and put the operation of each module and functional point under strict monitoring, at the same time, this monitoring must also be continuous and uninterrupted, only in this way, will not miss any system transaction abnormality, so that the operation of the IT application system is in a controllable state.
Because this process can obtain and accumulate detailed system operation status information, establish a very valuable system operation file, through its analysis and utilization, it can not only provide a reference for judging the quality of each module and each functional point, but also provide a basis for analyzing the development and change of the operating status of the system, making it possible to predict the health trend of an IT application system.





Previous:@天下无双给我们论坛的建议
Next:Window10 is new, and the system is still not mature enough
Disclaimer:
All software, programming materials or articles published by Code Farmer Network are only for learning and research purposes; The above content shall not be used for commercial or illegal purposes, otherwise, users shall bear all consequences. The information on this site comes from the Internet, and copyright disputes have nothing to do with this site. You must completely delete the above content from your computer within 24 hours of downloading. If you like the program, please support genuine software, purchase registration, and get better genuine services. If there is any infringement, please contact us by email.

Mail To:help@itsvse.com