Traditional Culture Encyclopedia - Traditional festivals - Is operations and maintenance really the least skilled position in the entire IT industry?

Is operations and maintenance really the least skilled position in the entire IT industry?

Operation and maintenance has always been a y misunderstood position in the Internet industry, so much so that many people believe that the IT industry's operation and maintenance is very low-skilled, which is actually not the case.

Essentially, O&M is really a position where you use your own technical reserves of knowledge to ensure that the IT services you manage are up and running.

The same is true in business. While software engineers are tasked with making software available to users in graphical form by writing code, operations engineers are tasked with making the software work properly on a computer or system. But in the event of a problem with the software, most people want to go to the software engineer, not the operations engineer.

It's like when we build a house. Product development is responsible for planning the house, designers are responsible for the exterior design of the house, development engineers are responsible for building the house, and operations and maintenance are responsible for laying the foundation of the house. And laying a good foundation doesn't mean simply digging a hole. There is a lot of technology involved. The size, depth, size, and moisture of the pit must be thoroughly studied.

After the house is built, people will only pay attention to the style of the house after it is built. Few people will pay attention to the foundation of the house, but once the house collapses, everyone will wonder if the foundation is strong, and O&M comes out at this time. Back to the pan.

Many people one-sidedly think that O&M has no technical content. This is actually a wrong understanding. Because O&M is also divided into many levels, depending on which stage you have reached. Basically, now an operation in addition to mastering the basics, if you can also master cloud computing technology and a programming language (such as Python language is most suitable for the operation of the maintenance staff), then you are already a high level of people, basically a full-stack development operation personnel. This kind of operation and maintenance do not have to worry about finding a job, the salary is naturally higher than other ordinary operation and maintenance.

I've been in both large and small companies myself. I think the main thing is that there are too many junior ops, and they do a lot of things that can't even be called ops. To summarize:

Ops inevitably does the basics, such as deploying services, going live, and even moving machines, reinstalling systems, and so on. But O&M can't just do this, so how to do things that are conducive to O&M skill improvement in the remaining time is especially important.

A simple example: when you do R&D, what position are you in, and how do you reflect your value and technical skills? If not, you're basically helping someone else.

The broad scope includes: hardware, network, operating system, database, storage, open source software; responsibilities: deployment and debugging of various functions, such as ldap, samba, nagios, etc.; further refinement of the division of labor also includes: stress testing, performance optimization, kernel parameter tuning, system problem tracking, and so on.

A lot of operations and maintenance to do too many things at different levels, resulting in a lot of things just to complete the task, the lack of in-depth research, of course, may also lack of in-depth study of the scene.

In fact, the relationship with the first point is relatively large, because the goal itself is not enough planning, not enough summarized introduction, and the improvement of technology is relatively limited.

To give a real example, I know someone who has been doing operations and maintenance for more than 7 years. During that time, he's done a lot of things at a couple of companies for a decent amount of time. Usually, there's a fair amount of buildup. A while back, when I was about to recommend him for internal batting, I checked out his resume. I had a few feelings: the entire resume was descriptive vocabulary with no data to support it; the project work was all narrative descriptions, full of service builds and problem solving, with no technical points; the only technical work was a one-off, with no program choices and no reflection of technical skills, so the technical level couldn't be reflected;

I've interviewed a lot of people myself, and to be honest, this kind of resume is far from a passing grade. How can a candidate company get a resume like this and quickly realize that you are the person the company needs?

If we don't know the specifics of O&M, we don't have the right to evaluate the technical content of O&M. Generally speaking, the content of the operation and maintenance of Internet companies is divided into two levels:

Simply put, it is deploying services, repairing computers, installing systems, installing software, dealing with network problems, and so on, and doing all sorts of chores, and even getting a router and cutting network cables.

Network Operations and Maintenance, that is, network engineering, must be proficient in a variety of network protocols and architecture, Cisco, Huawei, H3C routing and switching, at least two;

Database Operations and Maintenance, Database Operations and Maintenance should be understood as a DBA, at least be proficient, and to be proficient in the database;

Operating System Operations and Maintenance must be proficient in the operating system, to understand the operating system internal working principle, understand some hardware knowledge, understand the network protocols for troubleshooting;

There are many other things, such as server operations and maintenance, all need to cover a wide range of technologies at the same time;

Poor operations and maintenance technology, may be just because of the company's small, if the company's size is small, we can see that the work of the operation and maintenance of the work can only be the surface and the basis of the work, and now a lot of operations and maintenance positions have been cloud Now a lot of O&M jobs are being replaced by cloud services. What O&M is all about is running software on a cloud platform.

In fact, some people think it is very simple to operate the software on the platform, but in fact, without the accumulation of computer-related knowledge, it is difficult to know the implementation of functions on the cloud platform. In this regard, the technical content is not low.

If the company gradually grows into a large company, the value of operation and maintenance will be highlighted. For example, the management of cloud and offline resources, database management, network management, computing resources, network resource loads, scheduling and processing, all require a wealth of theoretical knowledge of computers and practical experience, otherwise it is not possible to provide a stable, upper layer of reliable services.

As a company that provides Internet services, the ability of users to use them stably and reliably is the foundation of their lives. Imagine a company that fails every three days and the service is unavailable. Although the presence of operations and maintenance is emphasized, will people still trust your product?

Ops function:

First, BAT has a more granular division of labor on Ops. Typically, system, database, and application O&M are completely separated. As a result, it is likely to be more focused on functionality, but of course the scope involved will certainly be narrow.

In terms of job functions, O&M is centered around three main areas: availability, efficiency improvement, and cost control, which are closely related to company and R&D goals. Much of what Ops does is based on these three goals. Disassembly.

In terms of technical improvements, this is primarily in the form of projects that utilize service understanding and technical solutions to solve common problems.

Technical work:

Take service availability for example. It's not just about handling alerts. Operate with care. It's as simple as writing some automation tools.

In terms of how you work:

Strictly follow a set plan for organizing work, reviewing it, and summarizing it. Are there clear rules for the implementation of the division of labor, and what time dimension is accurate to the quarter? Month? Week? Days? How often do I review?

Combining these aspects makes it possible for BAT Ops students to achieve rapid technical improvement. Here's what I've seen.

A final word on the direction of Ops:

In order to have a bright future in Ops, a couple of elements are needed:

At least a business that is already growing and has a certain machine size. There's no need to bat here, but choose what's right for you.

A lot of people don't like to deal with problems and then just think about doing the high and mighty. I hate to tell you the results of this, but it's not grounded, they're making things that aren't being used, and so on.

So I think an ops architect has to be someone who understands the business, knows the business, knows it very well. I've met people like that around me. They are very high level and usually don't deal with any issues, but in critical moments (e.g. when there is a problem) he can quickly find the key points and solve them, in some cases with even more details than you can. Understand and have to admire. Ops must be such a person!

It doesn't matter if you're repeatedly going online, troubleshooting issues, responding to requests, and developing maintenance scripts every day. The key is whether you have seen the pain points in the business and operations and maintenance from the problems you have done, and use the existing. technology solutions to deal with the solution!

There are a lot of problems, not that solving a lot of problems is a great person. The point is to solve problems in a way that also reflects your overall perspective and technical skills.

To take the simplest example, a machine's disk is almost full. This must be a particularly small problem. Operations students should encounter it all the time.

If you just check disk usage and then delete the data or tweak the script that deletes the disk, that's the worst possible documentation; checking disk usage, confirming that there's a problem with a standalone or a batch machine, why it's being reported at this point, and confirming that it's clear that it can be resolved, that's a much higher level; I looked at the disk occupancy and found out thoroughly what the reason for the growth of the disk was, but found out that the growth was uncontrollable of the existing data deletion methods can not avoid the alarm. So is there any way to ensure that the disk will not alarm when important data is normally retained? Then solve it with a technical solution, which is a higher level. . . . . . There are many examples of this.

You'll find that Ops is really about leveraging your familiarity with systems, networks, hardware, specs, and services, combined with specialized knowledge, to solve a series of common problems that R&D testing can't or won't solve with technical solutions. Solved individually. And it can lead to tools, platforms, and frameworks that ultimately create value for the operations and maintenance department and even the company. It's a great operation and maintenance.

So still the same sentence: there is no low tech position, all depends on how you do.

With the development of the times, we are now using any technology, many things can be solved by cloud computing, there are also corresponding products and programs to solve, cloud computing also has a certain impact on the operation and maintenance. New trends are emerging from this.

The first is from IOE to open source X86. actually go to IOE for some time, why go to IOE? In 2008, the whole network is more impressive. At that time, security has gradually risen to the national level. In addition, China's local environment was changing rapidly. The demand for localization and independent research and development capabilities were getting stronger and stronger. A strong internal gene was positioned. It was also taken into account that industries wanted the ability to flexibly control their structures, both on a national and corporate level. This is also a demand for localization in this industry, which is the second reason for going IOE. In the long run, IOE architecture and non-IOE architecture will exist for a long time ****, because the upgrading of technical systems can not be solved in a day or two, especially some core databases, core applications, core systems of the core system. Back then often deployed in the IOE framework.

The second is operation and maintenance automation and intelligence. This has been mentioned for several years, from contact practice to now about five or six years, and is still being mentioned. In fact, many industries have been iteratively optimizing O&M automation and intelligence. It can really bring a lot of advantages and benefits to our O&M.

The third is bimodal IT O&M. In the process of traditional transformation to the Internet and mobile, on the one hand, in order to ensure the operation of the existing business, on the other hand, in order to adapt to the changes of this new IT technology.

The fourth is the convergence of R&D and operations, i.e., DevOps. DevOps has penetrated millions of households in the past two to three years. Its core concepts include theories such as Lean and Agile, through continuous delivery, continuous integration tool chains, and some lightweight IT service management. Based on these concepts and tools, the whole process system from R&D to operation is formed. IT operation and maintenance is more efficient, iterative, faster feedback, and better meets the internal business needs and user needs. This is the value of the R&D operations integration concept.

The fifth is the integration of cloud resources to provide a larger platform to support big data, AI intelligence, operations and maintenance of everything in all walks of life This is also a major trend in the connected scene. This is both a challenge and an opportunity for O&M. Why? Because the industry is constantly changing, and technology is constantly changing, as long as we change in response to the general trend, we stand in the trend of the times.

If we are still conservative in the previous concept of operation and maintenance, not on the cloud, do not touch the cloud, then you must be eliminated, because I ten years ago it is difficult to deploy a database, a variety of configurations, a variety of calls, and now you can directly open a RDS, optimization, clustering is completed. In terms of efficiency and stability, minutes to reach the level of our traditional operation and maintenance, which is the general trend that we have to face the operation and maintenance.

Based on this, the concept of cloud native has become more popular in the past year or two. In fact, it is a deeper and broader integration of the existing cloud architecture system technology stack, using the concepts of Devops, microservices, Agile, and adopting concepts similar to those of China and Taiwan or the concept of openness to build and reshape the technology system to better support the rapid iterative development of new business, which in fact has a lot of similarities with the concept of DevOps.

The sixth one is digitization. This has been a hot topic in China for the last two years. In fact, it is as well. We used to build all kinds of information technology, built a lot of systems and platforms, but often we also built a lot of barriers that led to a lot of our information systems being unavailable and business fragmentation. Organizations are also fragmented. The problem that digitization is trying to solve is to build new services through the underlying data and algorithms to bridge our business. This is the problem that digitization is trying to solve.

Having talked about so many trends in broad terms, and there are certainly some, it's largely the same. It used to be with hardware, now it's software auto-definition; it used to be with servers, now it's with the cloud, and we're using the cloud now, and it's probably going to be more hybrid in the future. Cloud, cloud integration; used to be technology operations and maintenance, now engaged in the integration of technology operations and maintenance; in addition, equally important, no matter what we do now, cyberspace security is now elevated to the national level, in the enterprise also provides the highest point of the enterprise, this cybersecurity is a standard for IT.