Tag Archives: Big data

Hive SQL syntax error and corresponding solutions

Hive SQL syntax is different from the frequently used MySQL syntax. SQL written according to the habit of writing MySQL often reports errors, and it is difficult to see the cause of the problem. Therefore, this paper records the phenomenon of the problem and the solution

If you don’t find any problem with the alias: select from error ‘> select from error = 4200* From a) treror: error while compiling statement: failed: semanticexception [error 10025]: expression not in group by key ID (state = 42000, code = 10025)
cause: fields in the select statement but not in the group by statement will cause the error
solution: change the select id, name from a group by name to select collect_ Set (ID), name from a group by NameError: error while compiling statement: failed: semanticexception [error 10004]: Line 1:13 invalid table alias or column reference ‘ID’:
cause: the corresponding field in the subquery statement has changed, such as using a function or renaming
solution: select id, name from (select collect)_ Set (ID), name from a group by name) t “is changed to” select id, name from (select collect)_ Set (ID) id, name from a group by name) t or select t.id, name from (select collect)_ Set (ID), name from a group by name) tproblem: unable to query data after hive multiple SQL unions
cause: the data after union is saved in HDFS to multiple new directories under the table directory
solution: add configuration (which can be directly input on the CLI command line) set mapred.input.dir .recursive=true;
Or use a select statement to package multiple union statements and then execute hsql on tez to report an error. Out of memory
needs to adjust the size of the container
set hive.tez.container .size=4096;
set hive.tez.java . opts = – xmx3072m; hive does not query subdirectories recursively by default, so when creating a table, if there are subdirectories in the specified directory, it will report ERROR:not a file
You can perform the following four configurations in hive cli to enable recursive access to subdirectories in the callback. Instead of recursive query, all the data under the directory will be loaded in. Therefore, when the subdirectories are very deep or there are many subdirectories, the speed will be very slow.
set hive.input.dir .recursive=true; 
set hive.mapred.supports .subdirectories=true; 
set hive.supports.subdirectories=true ; 
set mapred.input.dir .recursive=true;

Common errors and solutions in MapReduce stage

1) Package guide is error prone. Especially text and combinetextinputformat. 2) The first input parameter in mapper must be longwritable or nullwritable, not intwritable.
the error reported is a type conversion exception.
3) java.lang.Exception : java.io.IOException : Legal partition for 13926435656 (4), indicating that partition
is not matched with the number of reducetask, so adjust the number of reducetask. 4) If the number of partitions is not 1, but reductask is 1, do you want to execute the partitioning process. The answer is: no partitioning.
Because in the source code of maptask, the premise of partition execution is to determine whether the number of reducenums is greater than 1. No more than 1 is definitely not implemented.
5) import the jar package compiled in Windows environment into linux environment to run,
Hadoop jar wc.jar com.atguigu.mapreduce . wordcount.WordCountDriver /User/atguigu/
/user/atguigu/output
reports the following error:
exception in thread “main” java.lang.UnsupportedClassVersionError :
com/atguigu/mapreduce/wordcount/WordCountDriver : Unsupported major.minor Version 52.0
the reason is that jdk1.7 is used in Windows environment and JDK1.8 is used in Linux environment.
Solution: unified JDK version.
6) cache pd.txt In the case of small files, the report cannot be found pd.txt File
reason: most of them are path writing errors. There is also to check pd.txt.txt It’s a matter of time. There are also some computers that write relative paths
that cannot be found pd.txt , which can be changed to absolute path. 7) Report type conversion exception.
It’s usually a writing error when setting the map output and final output in the driver function.
If the key output from map is not sorted, an exception of type conversion will be reported.
8) running in cluster wc.jar An error occurred when unable to get the input file.
Reason: the input file of wordcount case cannot be placed in the root directory of HDFS cluster.
9) the following related exceptions appear
exception in thread “main” java.lang.UnsatisfiedLinkError :
org.apache.hadoop . io.nativeio.NativeIO

W

i

n

d

o

w

s

.

a

c

c

e

s

s

0

(

L

j

a

v

a

/

l

a

n

g

/

S

t

r

i

n

g

;

I

)

Z

a

t

o

r

g

.

a

p

a

c

h

e

.

h

a

d

o

o

p

.

i

o

.

n

a

t

i

v

e

i

o

.

N

a

t

i

v

e

I

O

Windows.access0 (Ljava/lang/String;I)Z at org.apache.hadoop . io.nativeio.NativeIO

Windows.access0 (Ljava/lang/String; I) Zatorg.apache.hadoop . io.nativeio.NativeIOWindows .access0(Native Method)
at org.apache.hadoop . io.nativeio.NativeIO $ Windows.access ( NativeIO.java:609 )
at org.apache.hadoop . fs.FileUtil.canRead ( FileUtil.java:977 )
java.io.IOException : Could not locate executable null\bin\ winutils.exe in the Hadoop binaries.
at org.apache.hadoop . util.Shell.getQualifiedBinPath ( Shell.java:356 )
at org.apache.hadoop . util.Shell.getWinUtilsPath ( Shell.java:371 )
at org.apache.hadoop . util.Shell .( Shell.java:364 )
solution: copy hadoop.dll File to the windows directory C::?Windows?System32. Some students need to modify the Hadoop source code.
Scheme 2: create the following package name and NativeIO.java Copy to the package name
10) when customizing the output format, note that the close method in recordwirter must close the stream resource. Otherwise, the data in the output file is empty.
@Override
public void close(TaskAttemptContext context) throws IOException,
InterruptedException {
if (atguigufos != null) {
atguigufos.close ();
}
if (otherfos != null) {
otherfos.close ();
} }

How to save big data in Oracle to CLOB

If the data is too large to be assigned directly to a CLOB variable, then we can read and write to the CLOB variable in stages using the DBMS_LOB.read and DBMS_LOB.write methods. But at this point we should cache CLOB variables, as shown in the first sentence below. Here is an example program that can not only write big data but also read it by switching the order DBMS_LOB.read and DBMS_LOB.write.

dbms_lob.createtemporary(lob_loc => x_clob,
                             cache   => TRUE);
PROCEDURE load_clob(p_clob_in IN CLOB,
                      x_clob    IN OUT NOCOPY CLOB) IS
    
    l_clob_len NUMBER := dbms_lob.getlength(p_clob_in);
    l_data VARCHAR2(32756);
    l_buf_len_std NUMBER := 4000;
    l_buf_len_cur NUMBER;
    l_seg_count   NUMBER;
    l_write_offset NUMBER;
  BEGIN
    IF p_clob_in IS NOT NULL THEN
      l_seg_count := floor(l_clob_len/l_buf_len_std);
      FOR i IN 0 .. l_seg_count
      LOOP
        
        IF i = l_seg_count THEN
          l_buf_len_cur := l_clob_len - i * l_buf_len_std;
        ELSE
          l_buf_len_cur := l_buf_len_std;
        END IF;
        
        IF l_buf_len_cur > 0 THEN
          dbms_lob.read(lob_loc => p_clob_in,
                        amount  => l_buf_len_cur,
                        offset  => i * l_buf_len_std + 1,
                        buffer  => l_data);
          l_write_offset := nvl(dbms_lob.getlength(lob_loc => x_clob),
                                0) + 1;
          dbms_lob.write(lob_loc => x_clob,
                         amount  => l_buf_len_cur,
                         offset  => l_write_offset,
                         buffer  => l_data);
        END IF;
      END LOOP;
    END IF;
  END load_clob;

Reproduced in: https://blog.51cto.com/snans/1353672

In the next two years, how do data analysts hang up highly educated engineers? be convinced!

Summarize the necessary skills of data analysis, I hope it can help you.
One, data analysis three musketeers
We will analyze the mapping process for Pandas data, including data loading, cleaning, storage, conversion, merging, and remodelling. We will also analyze the mapping process for Pandas data

Second, the MySQL
Multi-platform installation and deployment of MySQL visualization tools and data import and export of multi-table relationship design and field constraints SQL to achieve sales task distribution system

III. Visualized package technology arrangement
Use Django to build a Web project, Web interface to show the principle of communication between browser and Web server routing, view, template, model association principle Seaborn various graphs to create Tableau worksheets, dashboards, story details

Fourth, quantitative analysis data collection
Optimal design algorithm for mathematical modeling of the shape and size of cans Detail the evaluation criteria of the algorithmic model for the study of strategic financial quality factors of “one price”

Five, Hadoop comprehensive analysis
MapReduce and Python programming in detail on the Cascading MapReduce principle analysis Cominer analysis of Spark analysis of SQL distributed and SQL query engine analysis
4. Selected video tutorials + learning documents


These [learning content], only for everyone to study and exchange use, trouble and advertising party please detour!!

Click on the link and leave your contact information, can rapid consultation, free nyc: https://t.csdnimg.cn/StoO

↓↓↓ ↓ with emphasis on ↓↓↓
Today’s society needs, is not only can write code code farmers, but technical and understand the business, can through data analysis, optimize the code to solve the actual business problems of complex talents!
Whether you do research and development, system architecture, or do products, operations, or even management, data analysis is the basic skills, it is no exaggeration to say: data analysis ability, can let you at least the next 10 years of technical career skills.
The author has made an exploratory analysis of the relevant information on the net specifically: 5W monthly salary is only in the middle position. If you want to change careers or work in the data industry, but don’t know where to start, I recommend you study CSDN. It is easier to enter the industry and has broad employment prospects.
Why do I recommend this course to you?
In fact, in any enterprise, each operation link will produce its corresponding data. When there is a problem in the enterprise, correct and complete data analysis can help the decision maker to make a wise and favorable decision. Data analysis plays a vital role in an enterprise.
Therefore, data analysis is like the doctor of the enterprise, which plays a vital role in the survival and development of the enterprise.
Based on this idea, I recommend CSDN’s self-run course “Data Analysis Training Camp” to you, which covers a wide range of content and integrates data collection, cleaning, sorting, visualization and modeling to help you establish the underlying logical thinking of data analysis!

This course is based on the analysis and mining methods often encountered in Python3, teaching you to find problems, form solutions, take actions, feedback and evaluation through data analysis, and form a closed loop, so that your data can give full play to the business value!

 
How is the course planned?
Scientific and systematic curriculum design
It covers all the technologies that will be applied in the jobs of data analysis engineers in the market, and focuses on the interview questions that must be asked and often asked by algorithms.
Real Stimulating Enterprise Projects
Across real business scenario projects such as finance, advertising, e-commerce, competition and academic experiment, on the basis of the existing courses, the real time training of 11 enterprise-level projects of hot technology has been added.
# Accompany you with a very conscientious teaching service #
Online learning, one-on-one Q&A, real project practice, regular testing, head teacher supervision, live Q&A, homework correction, all of these are designed to ensure that you can follow, finish and learn in the 12 weeks.
In addition, CSDN will also invite some industry celebrities to hold closed-door sharing meetings for students from time to time. Maybe just a little experience sharing in job hunting and daily work can help you avoid many detours.
Lecture given by engineers of front-line big factories
Leaders of front-line big companies such as BAT, Didi and netease serve as teaching tutors, and give in-depth explanations according to the requirements of market data mining positions, so as to directly understand the selection tendency of famous enterprises.
Senior headhunters recommend and guide employment
One to one employment program + resume modification + mock interview + interview tape copy + psychological counseling + free guidance work problems until probation!
Send: 50+ Interview Intensive Topics & AMP; 200+ practical interview question training
In the process of job hunting, CSDN is like your “coach”, providing targeted assistance services in every link of your job hunting.
Again! The most recent Talent Training Program has only 100 places available on a first-come-first-served basis! But also can enjoy super low – price benefits! (It is said that there are only a dozen spots left.)
If you have more questions, such as price, suitable for study, detailed study outline, you can scan the code and reply to the corresponding question.
Emphasis: if you want to test whether you are suitable for the industry, the instructor will send you an audition class + introductory materials + learning map + high-frequency interview questions based on your foundation, these materials are enough to help you test yourself whether you can engage in the relevant position!

Click on the link and leave your contact information, can rapid consultation, free nyc: https://t.csdnimg.cn/StoO

Common problems of Hadoop startup error reporting

After I have deployed Hadoop and YARN on the local virtual machine, I execute the startup command ./sbin/start-dfs.shbut I have various error reporting problems. Here I document two common problems.
Could not resolve hostname: Name or service not known
Error message:

19/05/17 21:31:18 WARN hdfs.DFSUtil: Namenode for null remains unresolved for ID null.  Check your hdfs-site.xml file to ensure namenodes are configured properly.
Starting namenodes on [jing-hadoop]
jing-hadoop: ssh: Could not resolve hostname jing-hadoop: Name or service not known
......

This is because without the node name in the configuration file jing - hadoop to join the domain mapping, so I can't identify the host name.
Solutions:

vim /etc/hosts
127.0.0.1  jing-hadoop

And then you start it up again.
jing-hadoop configured in h>site.xml , and t>orresponding node IP is 127.0.0.1e>. You should modify>according to your own environment, do not copy directly.
2, Unable to load Native Hadoop Library
Perform the start - DFS. Sh , also appeared the following error:

19/05/17 21:39:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
......

If you do not see the NameNode process after executing JPS, then this is definitely not possible.
The local classpath of Hadoop is not configured in the environment variable. The classpath is not configured in the environment variable.

vim /etc/profile

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

source /etc/profile

Then, again start - DFS. Sh , will still be reported as errors found, but after executing the JPS, found the NameNode process and DataNode process has been started normally, so will not affect the use.

[root@localhost hadoop-2.4.1]# jps
3854 NameNode
4211 Jps
3967 DataNode
4110 SecondaryNameNode

Among them, the number of the DataNode and the IP is in $HADOOP_HOME/etc/hadoop slave file configuration, if the configure multiple IP, will start multiple DataNode process.

code is 143 Container exited with a non-zero exit code 143

If we search this error message, and it is also possible due to code logic:
http://stackoverflow.com/questions/15281307/the-reduce-fails-due-to-task-attempt-failed-to-report-status-for-600-seconds (http://stackoverflow.com/questions/15281307/the-reduce-fails-due-to-task-attempt-failed-to-report-status-for-600-seconds-ki)
If you are sure code logic is not an issue, try to increase
mapred.task.timeout
 
ref:
1. https://mapr.com/community/s/question/0D50L00006BIu2GSAT/code-is-143-container-exited-with-a-nonzero-exit-code-143

Socket Error 104 bug

Description of the bug
Technology stack
nginxuwsgibottle
Error details
Alarm robots often have the following warnings:

<27>1 2018-xx-xxT06:59:03.038Z 660ece0ebaad admin/admin 14 - - Socket Error: 104
<31>1 2018-xx-xxT06:59:03.038Z 660ece0ebaad admin/admin 14 - - Removing timeout for next heartbeat interval
<28>1 2018-xx-xxT06:59:03.039Z 660ece0ebaad admin/admin 14 - - Socket closed when connection was open
<31>1 2018-xx-xxT06:59:03.039Z 660ece0ebaad admin/admin 14 - - Added: {'callback': <bound method SelectConnection._on_connection_start of <pika.adapters.select_connection.SelectConnection object at 0x7f74752525d0>>, 'only': None, 'one_shot': True, 'arguments': None, 'calls': 1}
<28>1 2018-xx-xxT06:59:03.039Z 660ece0ebaad admin/admin 14 - - Disconnected from RabbitMQ at xx_host:5672 (0): Not specified
<31>1 2018-xx-xxT06:59:03.039Z 660ece0ebaad admin/admin 14 - - Processing 0:_on_connection_closed
<31>1 2018-xx-xxT06:59:03.040Z 660ece0ebaad admin/admin 14 - - Calling <bound method _CallbackResult.set_value_once of <pika.adapters.blocking_connection._CallbackResult object at 0x7f74752513f8>> for "0:_on_connection_closed"

The debug process
Determine the error location
If you have a log, it’s easy to do. First of all, look at where the log is typed.
Our own code
No.
Uwsgi code
root@660ece0ebaad:/# uwsgi –version
2.0.14
from github co down, no.
Python Library code
Execute in the container

>>> import sys
>>> sys.path
['', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages/PILcompat', '/usr/lib/python2.7/dist-packages/gtk-2.0']

Under these directories, grep is found in pika

root@660ece0ebaad:/usr/local/lib/python2.7# grep "Socket Error" -R .
Binary file ./dist-packages/pika/adapters/base_connection.pyc matches
./dist-packages/pika/adapters/base_connection.py:            LOGGER.error("Fatal Socket Error: %r", error_value)
./dist-packages/pika/adapters/base_connection.py:            LOGGER.error("Socket Error: %s", error_code)

Determine the PIKA version.

>>> import pika
>>> pika.__version__
'0.10.0'

Error determination logic
It can be seen from the code that the Socket Error is the Error code of errno, and it is determined that the meaning of the Error code is that RST is sent to the opposite end.

>>> import errno
>>> errno.errorcode[104]
'ECONNRESET'

Suspected rabbitmq server address error, an unlistened port will return RST, after verification found that it is not.
then suspected link timeout break without notifying the client, etc. Take a look at the RabbitMQ Server logs and find a large number of:

=ERROR REPORT==== 7-Dec-2018::20:43:18 ===
closing AMQP connection <0.9753.18> (172.17.0.19:27542 -> 192.168.44.112:5672):
missed heartbeats from client, timeout: 60s
--
=ERROR REPORT==== 7-Dec-2018::20:43:18 ===
closing AMQP connection <0.9768.18> (172.17.0.19:27544 -> 192.168.44.112:5672):
missed heartbeats from client, timeout: 60s

It is found that all the links between RabbitMQ Server and Admin Docker have been broken

root@xxxxxxx:/home/dingxinglong# netstat -nap | grep 5672  | grep "172.17.0.19"

So why does RabbitMQ Server kick out piKA’s links?Look at the PIKA code comment:

    :param int heartbeat_interval: How often to send heartbeats.
                              Min between this value and server's proposal
                              will be used. Use 0 to deactivate heartbeats
                              and None to accept server's proposal.

We did not pass in the heartbeat interval, so in theory we should use the server default 60S. In fact, the client has never sent heartbeat packets.
verifies that a HeartbeatChecker object has been successfully created and a timer has been successfully created by printing, but the timer has never been called back.
follows through the code using blocking_connections, as seen in its add_timeout comment:

def add_timeout(self, deadline, callback_method):
    """Create a single-shot timer to fire after deadline seconds. Do not
    confuse with Tornado's timeout where you pass in the time you want to
    have your callback called. Only pass in the seconds until it's to be
    called.

    NOTE: the timer callbacks are dispatched only in the scope of
    specially-designated methods: see
    `BlockingConnection.process_data_events` and
    `BlockingChannel.start_consuming`.

    :param float deadline: The number of seconds to wait to call callback
    :param callable callback_method: The callback method with the signature
        callback_method()

The timer is triggered by process_data_Events and we are not calling it. So the client heartbeat is never triggered. Simply turn off heartbeat to solve this problem.
Specific trigger point

follows the code of basic_publish interface.
receives RST when sending, and finally prints socket_error in base_connection.py:452, _handle_error function.

def connect_mq():
    mq_conf = xxxxx
    connection = pika.BlockingConnection(
        pika.ConnectionParameters(mq_conf['host'],
                                  int(mq_conf['port']),
                                  mq_conf['path'],
                                  pika.PlainCredentials(mq_conf['user'],
                                                        mq_conf['pwd']),
                                  heartbeat_interval=0))
    channel = connection.channel()
    channel.exchange_declare(exchange=xxxxx, type='direct', durable=True)
    return channel

channel = connect_mq()

def notify_xxxxx():
    global channel

    def _publish(product):
        channel.basic_publish(exchange=xxxxx,
                              routing_key='xxxxx',
                              body=json.dumps({'msg': 'xxxxx'}))

Start flume agent and the solution of “a fatal error occurred while running” appears

The following error occurred when the log console was started after the installation of FluME

2020-11-13 18:10:23,564 ERROR [main] node.Application: A fatal error occurred while running. Exception follows.
java.lang.NullPointerException
	at java.io.File.<init>(File.java:277)
	at org.apache.flume.node.Application.main(Application.java:299)

I was also confused at that time. The steps were executed according to the tutorial in the book, including configuration and commands. I also checked that there was no problem

a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(its path is/usr/local/flume/conf/avro. Conf, configuration file path to clear, need to use in the solution)
The command I used before is

/usr/local/flume/bin/flume-ng agent -c . -f /usr/local/flume/conf/avro.conf -n a1 -Dflume.root.logger=INFO,console

When the error occurs, search the web for a solution and find it on StackOverflow, changing the command

flume-ng agent --conf /usr/local/flume/conf --conf-file /usr/local/flume/conf/avro.conf --name a1 -Dflume.root.logger=INFO,console

Type this command to start again and resolve.

Error in brew install: curl: (22) the requested URL returned error: 404 Not Found

When installing ZooKeeper today, I reported the following error

brew install zookeeper

Updating Homebrew...
==> Downloading https://mirrors.aliyun.com/homebrew/homebrew-bottles/bottles/zookeeper-3.4.13.mojave.bottle.tar.gz
######################################################################## 100.0
curl: (22) The requested URL returned error: 416
Error: Failed to download resource "zookeeper"
curl: (22) The requested URL returned error: 404 Not Found
Trying a mirror...
==> Downloading https://www-eu.apache.org/dist/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz

curl: (22) The requested URL returned error: 404 Not Found
Error: An exception occurred within a child process:
  DownloadError: Failed to download resource "zookeeper"
Download failed: https://www-us.apache.org/dist/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz

When I saw the error, I checked out zookeeper’s official website

I found that the official website had been updated to version 3.4.14, but BREW still downloaded version 3.4.13, so the above problems would be caused. I found that by looking up information online, I could operate this way and directly use the download address of the official website

brew install https://www-us.apache.org/dist/zookeeper/zookeeper-3.4.14/zookeeper-3.4.14.tar.gz

It has been proved that this method does not work
You need to update Git

brew install git

then

brew install zookeeper

done