Traditional Culture Encyclopedia - Traditional stories - IT operation and maintenance solutions

IT operation and maintenance solutions

It is recommended that the IT operation and maintenance service system follow the order of "easy to use, easy to summarize, and easy to manage", and solve objective problems from serious to light, so as to speed up the construction of the IT operation and maintenance service system to the greatest extent.

The operation and maintenance service system consists of six parts: operation and maintenance service system, operation and maintenance service process, operation and maintenance service organization, operation and maintenance service team, operation and maintenance technical service platform, and operation and maintenance objects, involving four elements: system, people, technology, and objects.

The operation and maintenance system is the basic guarantee for standardizing operation and maintenance management and is also the basis for process establishment.

Relevant personnel of the operation and maintenance organization adopt advanced operation and maintenance management platforms in accordance with system requirements and standardized processes to conduct standardized operation management and technical operations on various operation and maintenance objects.

IT fault location refers to the diagnosis of the direct cause or root cause of the fault. Fault location helps fault recovery actions to be more effective.

Fault location is usually the most time-consuming part of the entire fault process.

The goal of location is to recover quickly, not to find the source of the problem, which is the responsibility of problem management.

Normally, most availability failures are solved through assumptions and judgments based on the experience of operation and maintenance experts or the implementation of known solutions, but some failures, especially performance, usage logic, and data failures, require multi-party collaboration and tool support.

In data centers, many technical operation and maintenance personnel often have a keen ability to detect known faults and can quickly find the root cause of the problem based on the faults they encounter.

More senior experts can use the internal principles of the system to guess the possible reasons behind a certain phenomenon from some common fault phenomena.

Judging possible diagnosis paths based on the symptoms of faults is an essential ability for an operation and maintenance technical expert, which is often accumulated through a large number of operation and maintenance cases.

This is where experts differ from ordinary operation and maintenance personnel.

Accurate data collection actually relies on operational knowledge.

For example, we need to do fault analysis, which requires the use of CPU resources. So how to collect data?

Find the average or maximum threshold of CPU usage within a certain period of time?

Will there be a problem with 100% CPU utilization?

It's not that simple.

In fact, sudden CPU spikes are mostly harmless and may not have an adverse effect on our system.

Only when the long-term CPU utilization is close to a high level, the CPU may have a bottleneck of insufficient resources, thus affecting the performance of the system.

1. Operation and Maintenance Processing Principles During the operation of IT systems, problems or failures will inevitably occur.

The principles of troubleshooting can be summarized into two principles: All measures or methods prioritize quick business recovery.

Bugs or matching need to be upgraded and optimized in time.

1.1. Restoring business is a top priority Business recovery priority means that no matter what level of failure occurs under any circumstances, business should be restored first.

This is different from fault location. Many people have ambiguities and think that if the root cause of the problem has not been found, how can the business be restored?

Here is a simple example: If the use of system A and B debugging fails in the end, how to find and solve the problem?

(1) Ping the network using B from the server using A. If the port is connected to the network, then directly bind the host of server B.

(2) Troubleshoot the problem, find out which links pass between A and B, and find out the problematic links, including cross-server areas, cross-network segments, etc.

If the HA connection is abnormal, restart or expand and recover.

Usually, the first method takes a short time.

If there is cross-machine room access between A and B, then the first method will take longer to check.

Although the architectural balance between A and B is destroyed, it can take effect immediately, which is what we call priority business recovery.

1.2. Timely upgrade is easy to understand.

When any failure occurs, anyone can only make a simple prediction of the impact of the failure, so it is necessary to escalate to your leader in time so that he can have first-hand information and coordinate resources 4. Security upgrade packages from large manufacturers or

Equipment or upgraded systems; 2. Operation and maintenance mode: Based on operation and maintenance work requirements and operation and maintenance response time, decide to build a complete operation and maintenance plan and determine service standards.

On-site software and hardware inspections are the main way to enhance the execution of operation and maintenance plans.

Normally, the data center operation and maintenance workflow is as follows: (1) Build a complete operation and maintenance plan: In the entire operation and maintenance process, the plan is the core of the entire workflow.

In accordance with the principle of planning first, formulate sub-item work plans and time dimension plans based on this year's work plan, and implement and guarantee them in accordance with processes and plans.

(2) The importance of on-site inspection: The on-site inspection plan is the focus of the operation and maintenance work plan.

Through on-site inspections, you can find out the weak links, key business nodes and hidden dangers of the system. In particular, it is very important to formulate emergency plans and spare parts plans.

(3) The importance of execution: The implementation of the operation and maintenance plan is the focus of the operation and maintenance work.

During the implementation of the operation and maintenance plan, operation and maintenance should be carried out in strict accordance with the process specifications, and attention should be paid to control to reduce operation and maintenance risks.

Regarding the implementation of operation and maintenance, feedback should be provided to users regularly.

(4) Operation and maintenance service standards: Sign an after-sales service commitment letter and agree on service levels with customers.

The promised service level, including the resources provided (spare parts, etc.) and the solutions provided, should be strictly implemented in accordance with the agreement.