[Space] There are many right and wrong before and after, it can cause major failures, please see



System operation and maintenance has always been a delicate work. In addition to the constraints of rules and specifications, the preciseness and caution of operation and maintenance personnel are also essential. Sometimes a simple mistake will lead to a disaster, as small as a character or a space.
In this case, Oracle RAC suffered a failed restart due to a blank space.

Phenomenon of failure: The customer 10.2.0.4 RAC for Solaris 10 environment suddenly experienced an instance restart.
Failure process: The database runs normally until about 3 p.m., then the two nodes are restarted separately, and the instance on one of the nodes cannot be started automatically. A review of the alarm logs for both instances found that a significant ORA-27504 error occurred on both nodes before the node was restarted.
Error message:

ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:
if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information:
Requested Interface 192.168.168.3 NOT Found.
CHECK output FROM ifconfig command

Note that the error message is clear and the requested IP address does not exist, so you need to check the output of Ifconfig.

Next is the IPC timeout:

Wed Apr 10 15:08:13 2013
ospid 25678: network interface WITH IP
Address 192.168.168.3 No longer operational
Requested Interface 192.168.168.3 NOT Found.
CHECK output FROM ifconfig command
Wed Apr 10 15:08:16 2013
IPC Send timeout detected.Sender: ospid 25748
Receiver: inst 2 binc 430164 ospid 11890

Then instance expulsion is inevitable:

Wed Apr 10 15:16:40 2013
Waiting FOR instances TO leave:
2

The cause of the problem can be easily analyzed according to the error message. The IP address on node 2 was modified, causing abnormal heartbeat communication. Node 1 tried to kick node 2 out of the cluster, but could not communicate with node 2, so it had to wait for node 2 to restart.

Check the operating system log of Node 2 to get the following main information:

Apr 10 15:00:04 IP: [ID 482227 Kern. notice] IP_arp_done: Init failed
Had[4135]: [ID 702911 daemon.notice] VCS CRITICAL
CPU usage ON bj-sst IS 92%
sshd[13485]:error: Failed TO allocate internet-DOMAIN X11 display socket.

The IP_arp_done: init failed message appeared at 15:04 seconds, indicating that the host name information was used when setting up the network card interface, and the IP address of the host was modified online.

Finally, according to HISTORY, it was found that someone logged into the system through root:

Execute ifconfig — a6 to check the IPV6 address, but the command is typed incorrectly
Ifconfig — A 6 is executed, with an extra space between A and 6
Causes all IP addresses of the host to be set to 0.0.0.0

Thus causes the above whole fault, a blank causes the whole cluster to crash instantly, this is the blood case that a blank causes.

The lesson from this case is that any operation, at the command level, also needs to be careful for privileged users, including DBA users and ROOT users.

Review the use of the ifconfig command by the way:
The ifconfig command is used to configure and display network parameters for network interfaces in the Linux kernel. The network card information configured with the ifconfig command does not exist after the network card is restarted and the machine restarts. In order to keep the above configuration information in the computer forever, it is necessary to modify the configuration file of the network card.  

grammar
The ifconfig (parameters)

parameter
add< Address & gt; : Set the IP address of IPv6 for network devices;
del< Address & gt; : Delete the IP address of IPv6;
Down: Turn off the specified network device;
< hw< Type of Network Equipment & GT; < Hardware address & GT; : Set the type and hardware address of the network device;
io_addr< I/O address & gt; : Set the I/O address of the network device;
irq< IRQ address & gt; : Set the IRQ of the network device;
media< Type of Network Media & GT; : Set the media type of the network device;
mem_start< Memory address & GT; : Set the starting address occupied by the network device in the main memory;
metric< The number & gt; : Specifies the number to be added when calculating the number of times a packet is forwarded;
mtu< Byte & gt; : Set the MTU of the network device;
netmask< Subnet mask & GT; : Set the subnet mask of the network device;
tunnel< Address & gt; : Establish the channel communication address between IPv4 and IPv6;
Up: Starts the specified network device;
-broadcast< Address & gt; : Packets to be sent to the specified address will be treated as broadcast packets;
-pointopoint< Address & gt; : Establish a direct connection with the network device at the specified address. This mode has the security function;
— Promiscuous mode for turning off or starting designated network devices;
IP address: Specify the IP address of the network device;
Network device: Specifies the name of the network device.  

Explanation:
Eth0 represents the first network card, where HWaddr represents the physical address of the card. You can see that the current physical address of the card (MAC address) is 00:16:3E:00:1E:51.
Inet ADDr is used to represent the IP address of the network card. The IP address of this network card is 10.160.7.81, the broadcast address is 10.160.15.255, and the Mask address is 255.255.240.0.
Lo is the bad return address of the host. This is generally used to test a network program, but it does not want users on LAN or external network to be able to view it. Instead, it can only run and view the network interface used on this host. For example, if you specify the HTTPD server to return to a bad address, type 127.0.0.1 in your browser to see the WEB site you are hosting. But as long as you can see, no other host or user of the LAN knows.
Line 1: Connection type: Ethernet (Ethernet) HWaddr (hardware MAC address).
The second line: IP address, subnet, mask of the network card.
The third row: UP (for the nic’s open state) RUNNING (for the nic’s cable to be connected) MULTICAST MTU:1500 (for the maximum transmission unit); MULTICAST :1500 bytes.
The fourth and fifth lines: receiving and sending data packets.
Line 7: Receive and send data byte count statistics.
Start and close the specified network card:
The ifconfig eth0 up
The ifconfig eth0 down
Ifconfig eth0 up to start the network card eth0, ifconfig eth0 down to close the network card eth0. Use SSH to log into a Linux server. You can’t turn it on if it’s turned off, unless you have multiple network CARDS.

Configure and remove IPv6 addresses for network CARDS:
Ifconfig eth0 add 33 ffe: 3240:800-1005: : 2/64
Configure IPv6 addresses for the network card eth0
Ifconfig eth0 del ffe 33:3240-800:1005: : 2/64
Remove the IPv6 address for the network card eth0

Modify MAC address with IFConfig:
Ifconfig eth0 HW Ether 00:AA:BB:CC: DD :EE

Configure IP address:
[root@localhost ~]# ifconfig eth0 192.168.2.10
[root@localhost ~]# ifconfig eth0 192.168.2.10 Netmask 255.255.255.0
[root@localhost ~]# ifconfig eth0 192.168.2.10 Netmask 255.255.255.0 Broadcast 192.168.2.255

Enable and disable ARP protocol:
Ifconfig eth0 ARp # opens the ARP protocol for network card eth0
Ifconfig eth0 – ARp # Close the ARP protocol for network card eth0

Set the maximum transmission unit:
Ifconfig eth0 mtu 1500 # sets the maximum packet size that can pass to 1500 bytes

Comprehensive source: public “data and cloud”, etc

Read more about this article

Develop an enterprise-class monitoring platform in Python
Use Python code to automatically grab train tickets
Ctrip operation and maintenance automation platform, tens of thousands of server changes can also be very easy
Is intelligent operation and maintenance personnel replaced by AI?
Look at Tencent operation and maintenance to deal with the “18 years old photos of the national nostalgia” event plan, you will not regret!
Seamless operation: a best practice of alibaba’s operation and maintenance guarantee system
Forever young! The 20-year struggle history of an old operation and maintenance
Hungry?Remote dual live database combat
Operation and maintenance version of “Chengdu”, listen to how many people cry…

second level monitoring under the order of ali trillion transactions
Salvation of IT Operation and Maintenance — The ideal practice of SF Operation and Maintenance

Want to get a closer look at Tencent SNG team’s operation and maintenance system?

Come to the 9th GOPS Global Operations Conference.

Shenzhen, April 13-14, 2018.

The two-day conference features 19 special sessions covering a wide range of technical areas including AIOps, Operations automation and DevOps.



Click to read the original text and enter the official website of the conference

Read More: