Tensorflow: JRE Fatal error (SIGILL (0x4)) at loading _clustering_ops.so

0

Created a test java application which loads a trained python model through Tensorflow.

Had to add the below line to fix this exception "Op type not registered 'NearestNeighbors' in binary"

TensorFlow.loadLibrary(/tmp/path/to/_clustering_ops.so);

My application runs with no issue on my computer.

However, when running the application on a server, the application crashes with the following details.

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00007f40a00d923a, pid=1412, tid=0x00007f405a9e7700
#
# JRE version: OpenJDK Runtime Environment (8.0_171-b11) (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
# Java VM: OpenJDK 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [clustering_ops.so+0x823a]  Eigen::PlainObjectBase<Eigen::Matrix<float, -1, 1, 0, -1, 1> >::PlainObjectBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>,
Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<float>,
Eigen::Matrix<float, -1, 1, 0, -1, 1> const> const,
Eigen::PartialReduxExpr<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const, 0, Eigen::Stride<0, 0> > const,
Eigen::internal::member_squaredNorm<float>, 1> const> >    (Eigen::DenseBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, 
Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<float>,
Eigen::Matrix<float, -1, 1, 0, -1, 1> const> const,
Eigen::PartialReduxExpr<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const, 0, Eigen::Stride<0, 0> > const,
Eigen::internal::member_squaredNorm<float>, 1> const> > const&)+0x6a

Debugging:

(gdb) disassemble
Dump of assembler code for function __GI_raise:
   0x00007f8bad12f3f0 <+0>: mov    %fs:0x2d4,%ecx
   0x00007f8bad12f3f8 <+8>: mov    %fs:0x2d0,%eax
   0x00007f8bad12f400 <+16>:    movslq %eax,%rsi
   0x00007f8bad12f403 <+19>:    test   %esi,%esi
   0x00007f8bad12f405 <+21>:    jne    0x7f8bad12f438 <__GI_raise+72>
   0x00007f8bad12f407 <+23>:    mov    $0xba,%eax
   0x00007f8bad12f40c <+28>:    syscall 
   0x00007f8bad12f40e <+30>:    mov    %eax,%ecx
   0x00007f8bad12f410 <+32>:    mov    %eax,%fs:0x2d0
   0x00007f8bad12f418 <+40>:    movslq %eax,%rsi
   0x00007f8bad12f41b <+43>:    movslq %edi,%rdx
   0x00007f8bad12f41e <+46>:    mov    $0xea,%eax
   0x00007f8bad12f423 <+51>:    movslq %ecx,%rdi
   0x00007f8bad12f426 <+54>:    syscall 
=> 0x00007f8bad12f428 <+56>:    cmp    $0xfffffffffffff000,%rax
   0x00007f8bad12f42e <+62>:    ja     0x7f8bad12f450 <__GI_raise+96>
   0x00007f8bad12f430 <+64>:    repz retq 
   0x00007f8bad12f432 <+66>:    nopw   0x0(%rax,%rax,1)
   0x00007f8bad12f438 <+72>:    test   %ecx,%ecx
   0x00007f8bad12f43a <+74>:    jg     0x7f8bad12f41b <__GI_raise+43>
   0x00007f8bad12f43c <+76>:    mov    %ecx,%edx
   0x00007f8bad12f43e <+78>:    neg    %edx
   0x00007f8bad12f440 <+80>:    and    $0x7fffffff,%ecx
   0x00007f8bad12f446 <+86>:    cmove  %esi,%edx
   0x00007f8bad12f449 <+89>:    mov    %edx,%ecx
   0x00007f8bad12f44b <+91>:    jmp    0x7f8bad12f41b <__GI_raise+43>
   0x00007f8bad12f44d <+93>:    nopl   (%rax)
   0x00007f8bad12f450 <+96>:    mov    0x38ea21(%rip),%rdx        # 0x7f8bad4bde78
   0x00007f8bad12f457 <+103>:   neg    %eax
   0x00007f8bad12f459 <+105>:   mov    %eax,%fs:(%rdx)
   0x00007f8bad12f45c <+108>:   mov    $0xffffffff,%eax
   0x00007f8bad12f461 <+113>:   retq   
End of assembler dump.


(gdb) bt
#0  0x00007f8bad12f428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f8bad13102a in __GI_abort () at abort.c:89
#2  0x00007f8bac432c59 in ?? () from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#3  0x00007f8bac5e8047 in ?? () from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#4  0x00007f8bac43c6ef in JVM_handle_linux_signal () from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#5  0x00007f8bac42fd88 in ?? () from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#6  <signal handler called>
#7  0x00007f8ba808023a in Eigen::PlainObjectBase<Eigen::Matrix<float, -1, 1, 0, -1, 1> >::PlainObjectBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>,
Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<float>,
Eigen::Matrix<float, -1, 1, 0, -1, 1> const> const,
Eigen::PartialReduxExpr<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const, 0, Eigen::Stride<0, 0> > const,
Eigen::internal::member_squaredNorm<float>, 1> const> >(Eigen::DenseBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>,
Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<float>,
Eigen::Matrix<float, -1, 1, 0, -1, 1> const> const,
Eigen::PartialReduxExpr<Eigen::Map<Eigen::Matrix<float, -1, -1, 1, -1, -1> const, 0, Eigen::Stride<0, 0> > const,
Eigen::internal::member_squaredNorm<float>, 1> const> > const&) ()
from /srv/path/to/clustering_ops.so
#8  0x00007f8ba8088e6e in 
tensorflow::NearestNeighborsOp::Compute(tensorflow::OpKernelContext*) ()     
from /srv/path/to/_clustering_ops.so
#9  0x00007f8b5dbf364c in ?? ()
#10 0x0000000000000000 in ?? ()

I am suspecting this is an issue with the server. However cannot figure out what it is. I made sure both environment were the same (my instance on the server and localhost: Ubuntu 16.04.4 LTS and javac 1.8.0_171). I also ran a RAM test on the server and didn't get an issue.

Would appreciate if someone pointed me in the right direction to get a fix to this.


UPDATE 1: Thank you for the reply @Employed Russian.

I hadn't build the .so file myself but I am retrieving it from the tensorflow library files.

Following your recommendations I thought of cloning the entire tensorflow project on github and build the clustering_ops.so from the clustering_ops.cc file in 'tensorflow/contrib/factorization/ops/clustering_ops.cc'. However, I had to give up on that, at least for now, because of the too many paths' updates required in the imports.

I then thought that if this was a hardware compatibility issue, I would install tensorflow on the server and use the clustering_ops.so file found in the downloaded files. This I did and, good enough, I am getting a different error:

2018-07-03 14:37:47.871 ERROR 13026 --- [nio-9090-exec-1] o.a.c.c.C.[.[.[.[dispatcherServlet]      : Servlet.service() for servlet [dispatcherServlet] in context with path [/test] threw exception [Handler dispatch failed; nested exception is java.lang.UnsatisfiedLinkError: $HOME/clustering_ops.so: undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumE] with root cause

java.lang.UnsatisfiedLinkError: $HOME/clustering_ops.so: undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumE
at org.tensorflow.TensorFlow.loadLibrary(TensorFlow.java:47) ~[libtensorflow-1.5.0.jar!/:na]
at com.domain.serverTest.controller.TestController.postSomething(TestController.java:41) ~[classes!/:0.0.1-SNAPSHOT]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_171]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_171]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_171]
at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_171]
at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:209) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:136) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:102) ~[spring-webmvc-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:877) ~[spring-webmvc-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:783) ~[spring-webmvc-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:991) ~[spring-webmvc-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:925) ~[spring-webmvc-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:974) ~[spring-webmvc-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:877) ~[spring-webmvc-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:661) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:851) ~[spring-webmvc-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) ~[tomcat-embed-websocket-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:99) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.springframework.web.filter.HttpPutFormContentFilter.doFilterInternal(HttpPutFormContentFilter.java:109) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:93) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198) ~[tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:496) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:803) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:790) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1468) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_171]
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-embed-core-8.5.31.jar!/:8.5.31]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171]

UPDATE 2: Downloading tensorflow from source and compiling with the right setting for the -march flag resolved the above error. However, another issue arose on which I would appreciate any help. I've been battling with it for some time now and failed to get a hint as to what might be the root cause.

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb191313512, pid=5931, tid=0x00007fb13abe8700
#
# JRE version: OpenJDK Runtime Environment (8.0_171-b11) (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
# Java VM: OpenJDK 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x84512]  cfree+0x22
tensorflow
gdb
asked on Stack Overflow Jun 28, 2018 by Kab111 • edited Jul 12, 2018 by Kab111

1 Answer

0

I am suspecting this is an issue with the server. However cannot figure out what it is

The issue very likely is similar to this one.

Your development machine and your server have different processors with different instruction sets (server being older), and when you build on the development machine, the compiler (by default) generates instructions that work fine on development machine, but do not work on the server.

(gdb) disassemble Dump of assembler code for function __GI_raise:

That is not the function you want to disassemble. What you want is:

(gdb) x/i 0x00007f8ba808023a

which is the instruction that generated SIGILL. You are likely to find that that is an avx2 instruction, and that your server doesn't support avx2.

You can see what your server supports in /proc/cpuinfo (or just Google the model number).

Once you've identified the instruction set your server supports, build your code with appropriate -march=... setting, and it should work on both the development machine and the server.

answered on Stack Overflow Jun 29, 2018 by Employed Russian

User contributions licensed under CC BY-SA 3.0