Show Hive Query Log in Workbench
Improve Workbench usability
One of the most accessed databases with Discovery’s workbench tools is Hive.
People use the workbench to perform queries that take tens of minutes to even hours.
There have been many usability improvement requirements associated with this.
I would like to talk about one of them, “Show log of running query”.
Show log of running query
Users want to show the log information of the currently executing query rather than wait idly.
If you check the HiveStatement javadocs, there are related methods.
https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hive/jdbc/HiveStatement.html#getQueryLog()
(Metatron-Discovery is currently using the 1.x version of hive-jdbc as a backwards compatible issue.)
One setting is required to get the SQL Execution Log using the getQueryLog () method.
Since Hive 2.1.0, Hive uses the asynchronous logger of Log4j2 by default.
To get the log using the getQueryLog() method, you need to turn this feature off.
hive.async.log.enabled should be set to false. The default value is true.
<property> <name>hive.async.log.enabled</name> <value>true</value> </property>
** These options are related to Hive’s performance, so please make a careful decision.
And the hive.server2.logging.operation.enabled setting must be true, but its setting is true by default.
(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.server2.logging.operation.enabled)
Now that we’ve set up Hive and checked the methods we need to use, let’s add those features.
Code implementation
Major implementation details are as follows.
1. After creating a thread for Log Print, print the log list every 1 second from statement.
2. When the query execution is completed, the log print thread is interrupted and the log is fetched until there is no remaining log.
String connUrl = "jdbc:hive2://localhost:10009"; String username = "hive"; String password = "hive"; String query = "select city, sum(sales) from default.sales group by city order by city limit 10"; DataSource ds = new DriverManagerDataSource(connUrl, username, password); try{ Connection connection = ds.getConnection(); Statement stmt = connection.createStatement(); //Start query log thread Thread queryLogThread = new Thread(() -> { try{ Thread.sleep(1000L); while(!Thread.currentThread().isInterrupted()){ Thread.sleep(50L); //print query log List<String> logLists = ((HiveStatement) stmt).getQueryLog(); for(String log : logLists){ System.out.println(log); } } } catch(InterruptedException e){ System.out.println("Print Last Query Log"); List<String> logLists = null; do { try { logLists = ((HiveStatement) stmt).getQueryLog(); } catch (SQLException sqle) { return; } for(String log : logLists){ System.out.println(log); } } while (logLists.size() > 0); } catch(SQLException e){ e.printStackTrace(); } finally { System.out.println("Query Log thread is dead"); } }); queryLogThread.start(); //Sql Execute boolean hasResultSet = stmt.execute(query); //interrupt log thread System.out.println("Query log thread interrupt."); queryLogThread.interrupt(); if(hasResultSet){ ResultSet resultSet = stmt.getResultSet(); while(resultSet.next()){ System.out.println("Query Result : " + resultSet.getObject(1) + ", " + resultSet.getObject(2)); } } }catch(SQLException e){ e.printStackTrace(); }
Properties related to Log Print Level
If you run the sample code, you will see that the Log is printed as shown below.
INFO : Compiling command(queryId=hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d): select city, sum(sales) from default.sales group by city order by city limit 10 INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:city, type:string, comment:null), FieldSchema(name:_c1, type:double, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d); Time taken: 0.159 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d): select city, sum(sales) from default.sales group by city order by city limit 10 INFO : Query ID = hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in parallel INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: select city, sum(sales) from default.sa...10(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1540788884137_5535) INFO : Completed executing command(queryId=hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d); Time taken: 14.56 seconds INFO : OK Query log thread interrupt. Print Last Query Log Query Result : Aberdeen, 26.0 Query Result : Abilene, 1.0 Query Result : Akron, 2729.0 Query Result : Albuquerque, 2219.0 Query Result : Alexandria, 5519.0 Query Result : Allen, 290.0 Query Result : Allentown, 854.0 Query Result : Altoona, 20.0 Query Result : Amarillo, 3773.0 Query Result : Anaheim, 7986.0 Query Log thread is dead Test finished.
However, due to “group by” query, Map/Reducer count log should be printed but not shown in the log.
Log Level
Let’s change the log level by using the Hive property.
<property> <name>hive.server2.logging.operation.level</name> <value>EXECUTION</value> <description> Expects one of [none, execution, performance, verbose]. HS2 operation logging mode available to clients to be set at session level. For this to work, hive.server2.logging.operation.enabled should be set to true. NONE: Ignore any logging EXECUTION: Log completion of tasks PERFORMANCE: Execution + Performance logs VERBOSE: All logs </description> </property>
(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.server2.logging.operation.level)
There are 3 levels in total. The default is EXECUTION Level.
– NONE: Ignore any logging.
– EXECUTION: Log completion of tasks.
– PERFORMANCE: Execution + Performance logs.
– VERBOSE: All logs.
The above logs are “EXECUTION” Level logs.
You can view each level’s log and select the appropriate level.
One more property
There is one more property to introduce.
<property> <name>hive.server2.in.place.progress</name> <value>true</value> <description>Allows hive server 2 to send progress bar update information. This is currently available only if the execution engine is tez.</description> </property>
(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.server2.in.place.progress)
This is the property to send progress bar update information when the excution engine is tez.
The default value is true. If this property is set to false, progress update information is printed to the log.
If hive.server2.logging.operation.level is set to “EXECUTION”, which is the default, hive.server2.in.place.progress can be set to false to display additional progress information in the log.
– hive.server2.logging.operation.level=EXECUTION, hive.server2.in.place.progress=false
INFO : Compiling command(queryId=hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b): select city, sum(sales) from default.sales group by city order by city limit 10 INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:city, type:string, comment:null), FieldSchema(name:_c1, type:double, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b); Time taken: 0.159 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b): select city, sum(sales) from default.sales group by city order by city limit 10 INFO : Query ID = hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in parallel INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: select city, sum(sales) from default.sa...10(Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1540788884137_5549) INFO : Map 1: 0/1 Reducer 2: 0/2 Reducer 3: 0/1 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 0(+1)/2 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 1(+1)/2 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 2/2 Reducer 3: 0(+1)/1 INFO : Map 1: 1/1 Reducer 2: 2/2 Reducer 3: 1/1 INFO : Completed executing command(queryId=hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b); Time taken: 14.878 seconds INFO : OK Query log thread interrupt. Print Last Query Log Query Result : Aberdeen, 26.0 Query Result : Abilene, 1.0 Query Result : Akron, 2729.0 Query Result : Albuquerque, 2219.0 Query Result : Alexandria, 5519.0 Query Result : Allen, 290.0 Query Result : Allentown, 854.0 Query Result : Altoona, 20.0 Query Result : Amarillo, 3773.0 Query Result : Anaheim, 7986.0 Query Log thread is dead Test finished.
Now that you know how to print a log when executing hive query.