Show Hive Query Log in Workbench

Improve Workbench usability

One of the most accessed databases with Discovery’s workbench tools is Hive.
People use the workbench to perform queries that take tens of minutes to even hours.
There have been many usability improvement requirements associated with this.
I would like to talk about one of them, “Show log of running query”.

Show log of running query

Users want to show the log information of the currently executing query rather than wait idly.

If you check the HiveStatement javadocs, there are related methods.
https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hive/jdbc/HiveStatement.html#getQueryLog()
(Metatron-Discovery is currently using the 1.x version of hive-jdbc as a backwards compatible issue.)

One setting is required to get the SQL Execution Log using the getQueryLog () method.

Since Hive 2.1.0, Hive uses the asynchronous logger of Log4j2 by default.
To get the log using the getQueryLog() method, you need to turn this feature off.
hive.async.log.enabled should be set to false. The default value is true.

(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.async.log.enabled)

<property>
  <name>hive.async.log.enabled</name>
  <value>true</value>
</property>

** These options are related to Hive’s performance, so please make a careful decision.

And the hive.server2.logging.operation.enabled setting must be true, but its setting is true by default.
(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.server2.logging.operation.enabled)

Now that we’ve set up Hive and checked the methods we need to use, let’s add those features.

Code implementation

Major implementation details are as follows.
1. After creating a thread for Log Print, print the log list every 1 second from statement.
2. When the query execution is completed, the log print thread is interrupted and the log is fetched until there is no remaining log.

String connUrl = "jdbc:hive2://localhost:10009";
String username = "hive";
String password = "hive";
String query = "select city, sum(sales) from default.sales group by city order by city limit 10";
DataSource ds = new DriverManagerDataSource(connUrl, username, password);

try{
  Connection connection = ds.getConnection();
  Statement stmt = connection.createStatement();

  //Start query log thread
  Thread queryLogThread = new Thread(() -> {
    try{
      Thread.sleep(1000L);
      while(!Thread.currentThread().isInterrupted()){
        Thread.sleep(50L);

        //print query log
        List<String> logLists = ((HiveStatement) stmt).getQueryLog();
        for(String log : logLists){
          System.out.println(log);
        }
      }
    } catch(InterruptedException e){
      System.out.println("Print Last Query Log");
      List<String> logLists = null;
      do {
        try {
          logLists = ((HiveStatement) stmt).getQueryLog();
        } catch (SQLException sqle) {
          return;
        }
        for(String log : logLists){
          System.out.println(log);
        }
      } while (logLists.size() > 0);
    } catch(SQLException e){
      e.printStackTrace();
    }
    finally {
      System.out.println("Query Log thread is dead");
    }
  });
  queryLogThread.start();

  //Sql Execute
  boolean hasResultSet = stmt.execute(query);

  //interrupt log thread
  System.out.println("Query log thread interrupt.");
  queryLogThread.interrupt();

  if(hasResultSet){
    ResultSet resultSet = stmt.getResultSet();
    while(resultSet.next()){
      System.out.println("Query Result : " + resultSet.getObject(1) + ", " + resultSet.getObject(2));
    }
  }
}catch(SQLException e){
  e.printStackTrace();
}

Properties related to Log Print Level

If you run the sample code, you will see that the Log is printed as shown below.

INFO  : Compiling command(queryId=hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d): select city, sum(sales) from default.sales group by city order by city limit 10
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:city, type:string, comment:null), FieldSchema(name:_c1, type:double, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d); Time taken: 0.159 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d): select city, sum(sales) from default.sales group by city order by city limit 10
INFO  : Query ID = hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in parallel
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select city, sum(sales) from default.sa...10(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1540788884137_5535)

INFO  : Completed executing command(queryId=hive_20181107041359_169d4758-e488-4b07-b2a6-7e53e2cf042d); Time taken: 14.56 seconds
INFO  : OK
Query log thread interrupt.
Print Last Query Log
Query Result : Aberdeen, 26.0
Query Result : Abilene, 1.0
Query Result : Akron, 2729.0
Query Result : Albuquerque, 2219.0
Query Result : Alexandria, 5519.0
Query Result : Allen, 290.0
Query Result : Allentown, 854.0
Query Result : Altoona, 20.0
Query Result : Amarillo, 3773.0
Query Result : Anaheim, 7986.0
Query Log thread is dead
Test finished.

However, due to “group by” query, Map/Reducer count log should be printed but not shown in the log.

Log Level

Let’s change the log level by using the Hive property.

<property>
  <name>hive.server2.logging.operation.level</name>
  <value>EXECUTION</value>
  <description>
    Expects one of [none, execution, performance, verbose].
    HS2 operation logging mode available to clients to be set at session level.
    For this to work, hive.server2.logging.operation.enabled should be set to true.
      NONE: Ignore any logging
      EXECUTION: Log completion of tasks
      PERFORMANCE: Execution + Performance logs
      VERBOSE: All logs
  </description>
</property>

(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.server2.logging.operation.level)
There are 3 levels in total. The default is EXECUTION Level.
– NONE: Ignore any logging.
– EXECUTION: Log completion of tasks.
– PERFORMANCE: Execution + Performance logs.
– VERBOSE: All logs.

The above logs are “EXECUTION” Level logs.
You can view each level’s log and select the appropriate level.

One more property

There is one more property to introduce.

<property>
  <name>hive.server2.in.place.progress</name>
  <value>true</value>
  <description>Allows hive server 2 to send progress bar update information. This is currently available only if the execution engine is tez.</description>
</property>

(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.server2.in.place.progress)
This is the property to send progress bar update information when the excution engine is tez.
The default value is true. If this property is set to false, progress update information is printed to the log.

If hive.server2.logging.operation.level is set to “EXECUTION”, which is the default, hive.server2.in.place.progress can be set to false to display additional progress information in the log.

– hive.server2.logging.operation.level=EXECUTION, hive.server2.in.place.progress=false

INFO  : Compiling command(queryId=hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b): select city, sum(sales) from default.sales group by city order by city limit 10
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:city, type:string, comment:null), FieldSchema(name:_c1, type:double, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b); Time taken: 0.159 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b): select city, sum(sales) from default.sales group by city order by city limit 10
INFO  : Query ID = hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in parallel
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select city, sum(sales) from default.sa...10(Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1540788884137_5549)

INFO  : Map 1: 0/1	Reducer 2: 0/2	Reducer 3: 0/1	
INFO  : Map 1: 0(+1)/1	Reducer 2: 0/2	Reducer 3: 0/1	
INFO  : Map 1: 1/1	Reducer 2: 0(+1)/2	Reducer 3: 0/1	
INFO  : Map 1: 1/1	Reducer 2: 1(+1)/2	Reducer 3: 0/1	
INFO  : Map 1: 1/1	Reducer 2: 2/2	Reducer 3: 0(+1)/1	
INFO  : Map 1: 1/1	Reducer 2: 2/2	Reducer 3: 1/1	
INFO  : Completed executing command(queryId=hive_20181107043949_e2a82e77-86cf-4afa-92af-f8617a6f610b); Time taken: 14.878 seconds
INFO  : OK
Query log thread interrupt.
Print Last Query Log
Query Result : Aberdeen, 26.0
Query Result : Abilene, 1.0
Query Result : Akron, 2729.0
Query Result : Albuquerque, 2219.0
Query Result : Alexandria, 5519.0
Query Result : Allen, 290.0
Query Result : Allentown, 854.0
Query Result : Altoona, 20.0
Query Result : Amarillo, 3773.0
Query Result : Anaheim, 7986.0
Query Log thread is dead
Test finished.

Now that you know how to print a log when executing hive query.

Show Hive Query Log in Workbench