/* * Copyright (c) 2004, 2007, Oracle and/or its affiliates. All rights reserved. * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. * * This code is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License version 2 only, as * published by the Free Software Foundation. * * This code is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License * version 2 for more details (a copy is included in the LICENSE file that * accompanied this code). * * You should have received a copy of the GNU General Public License version * 2 along with this work; if not, write to the Free Software Foundation, * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. * * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA * or visit www.oracle.com if you need additional information or have any * questions. */ /** * @test * @bug 5033550 * @summary JDWP back end uses modified UTF-8 * * @author jjh * * @run build TestScaffold VMConnection TargetListener TargetAdapter * @run compile -g UTF8Test.java * @run driver UTF8Test */ /* There is UTF-8 and there is modified UTF-8, which I will call M-UTF-8. The two differ in the representation of binary 0, and in some other more esoteric representations. See http://java.sun.com/developer/technicalArticles/Intl/Supplementary/#Modified_UTF-8 http://java.sun.com/javase/6/docs/technotes/guides/jni/spec/types.html#wp16542 All the following are observations of the treatment of binary 0. In UTF-8, this represented as one byte: 0x00 while in modified UTF-8, it is represented as two bytes 0xc0 0x80 ** I haven't investigated if the other differences between UTF-8 and M-UTF-8 are handled in the same way. Here is how these our handled in our BE, JDWP, and FE: - Strings in .class files are M-UTF-8. - To get the value of a string object from the VM, our BE calls char * utf = JNI_FUNC_PTR(env,GetStringUTFChars)(env, string, NULL); which returns M-UTF-8. - To create a string object in the VM, our BE VirtualMachine.createString() calls string = JNI_FUNC_PTR(env,NewStringUTF)(env, cstring); This function expects the string to be M-UTF-8 BUG: If the string came from JDWP, then it is actually UTF-8 - I haven't investigated strings in JVMTI. - The JDWP spec says that strings are UTF-8. The intro says this for all strings, and the createString command and the StringRefernce.value command say it explicitly. - Our FE java writes strings to JDWP as UTF-8. - BE function outStream_writeString uses strlen meaning it expects no 0 bytes, meaning that it expects M-UTF-8 This function writes the byte length and then calls outStream.c::writeBytes which just writes the bytes to JDWP as is. BUG: If such a string came from the VM via JNI, it is actually M-UTF-8 FIX: - scan string to see if contains an M-UTF-8 char. if yes, - call String(bytes, 0, len, "UTF8") to get a java string. Will this work -ie, the input is M-UTF-8 instead of real UTF-8 - call some java method (NOT JNI which would just come back with M-UTF-8) on the String to get real UTF-8 - The JDWP StringReference.value command does reads a string from the BE out of the JDWP stream and does this to createe a Java String for it (see PacketStream.readString): String readString() { String ret; int len = readInt(); try { ret = new String(pkt.data, inCursor, len, "UTF8"); } catch(java.io.UnsupportedEncodingException e) { This String ctor converts _both- the M-UTF-8 0xc0 0x80 and UTF-8 0x00 into a Java char containing 0x0000 Does it do this for the other differences too? Summary: 1. JDWP says strings are UTF-8. We interpret this to mean standard UTF-8. 2. JVMTI will be changed to match JNI saying that strings are M-UTF-8. 3. The BE gets UTF-8 strings off JDWP and must convert them to M-UTF-8 before giving it to JVMTI or JNI. 4. The BE gets M-UTF-8 strings from JNI and JVMTI and must convert them to UTF-8 when writing to JDWP. Here is how the supplementals are represented in java Strings. This from java.lang.Character doc: The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF). See utf8.txt ---- NSK Packet.java in the nsk/share/jdwp framework does this to write a string to JDWP: public void addString(String value) { final int count = JDWP.TypeSize.INT + value.length(); addInt(value.length()); try { addBytes(value.getBytes("UTF-8"), 0, value.length()); } catch (UnsupportedEncodingException e) { throw new Failure("Unsupported UTF-8 ecnoding while adding string value to JDWP packet:\n\t" + e); } } ?? Does this get the standard UTF-8? I would expect so. and the readString method does this: for (int i = 0; i < len; i++) s[i] = getByte(); try { return new String(s, "UTF-8"); } catch (UnsupportedEncodingException e) { throw new Failure("Unsupported UTF-8 ecnoding while extracting string value from JDWP packet:\n\t" + e); } Thus, this won't notice the modified UTF-8 coming in from JDWP . */ import com.sun.jdi.*; import com.sun.jdi.event.*; import com.sun.jdi.request.*; import java.io.UnsupportedEncodingException; import java.util.*; /********** target program **********/ /* * The debuggee has a few Strings the debugger reads via JDI */ class UTF8Targ { static String[] vals = new String[] {"xx\u0000yy", // standard UTF-8 0 "xx\ud800\udc00yy", // first supplementary "xx\udbff\udfffyy" // last supplementary // d800 = 1101 1000 0000 0000 dc00 = 1101 1100 0000 0000 // dbff = 1101 1011 1111 1111 dfff = 1101 1111 1111 1111 }; static String aField; public static void main(String[] args){ System.out.println("Howdy!"); gus(); System.out.println("Goodbye from UTF8Targ!"); } static void gus() { } } /********** test program **********/ public class UTF8Test extends TestScaffold { ClassType targetClass; ThreadReference mainThread; Field targetField; UTF8Test (String args[]) { super(args); } public static void main(String[] args) throws Exception { new UTF8Test(args).startTests(); } /********** test core **********/ protected void runTests() throws Exception { /* * Get to the top of main() * to determine targetClass and mainThread */ BreakpointEvent bpe = startToMain("UTF8Targ"); targetClass = (ClassType)bpe.location().declaringType(); targetField = targetClass.fieldByName("aField"); ArrayReference targetVals = (ArrayReference)targetClass.getValue(targetClass.fieldByName("vals")); /* For each string in the debuggee's 'val' array, verify that we can * read that value via JDI. */ for (int ii = 0; ii < UTF8Targ.vals.length; ii++) { StringReference val = (StringReference)targetVals.getValue(ii); String valStr = val.value(); /* * Verify that we can read a value correctly. * We read it via JDI, and access it directly from the static * var in the debuggee class. */ if (!valStr.equals(UTF8Targ.vals[ii]) || valStr.length() != UTF8Targ.vals[ii].length()) { failure(" FAILED: Expected /" + printIt(UTF8Targ.vals[ii]) + "/, but got /" + printIt(valStr) + "/, length = " + valStr.length()); } } /* Test 'all' unicode chars - send them to the debuggee via JDI * and then read them back. */ doFancyVersion(); resumeTo("UTF8Targ", "gus", "()V"); try { Thread.sleep(1000); } catch (InterruptedException ee) { } /* * resume the target listening for events */ listenUntilVMDisconnect(); /* * deal with results of test * if anything has called failure("foo") testFailed will be true */ if (!testFailed) { println("UTF8Test: passed"); } else { throw new Exception("UTF8Test: failed"); } } /** * For each unicode value, send a string containing * it to the debuggee via JDI, read it back via JDI, and see if * we get the same value. */ void doFancyVersion() throws Exception { // This does 4 chars at a time just to save time. for (int ii = Character.MIN_CODE_POINT; ii < Character.MIN_SUPPLEMENTARY_CODE_POINT; ii += 4) { // Skip the surrogates if (ii == Character.MIN_SURROGATE) { ii = Character.MAX_SURROGATE - 3; break; } doFancyTest(ii, ii + 1, ii + 2, ii + 3); } // Do the supplemental chars. for (int ii = Character.MIN_SUPPLEMENTARY_CODE_POINT; ii <= Character.MAX_CODE_POINT; ii += 2000) { // Too many of these so just do a few doFancyTest(ii, ii + 1, ii + 2, ii + 3); } } void doFancyTest(int ... args) throws Exception { String ss = new String(args, 0, 4); targetClass.setValue(targetField, vm().mirrorOf(ss)); StringReference returnedVal = (StringReference)targetClass.getValue(targetField); String returnedStr = returnedVal.value(); if (!ss.equals(returnedStr)) { failure("Set: FAILED: Expected /" + printIt(ss) + "/, but got /" + printIt(returnedStr) + "/, length = " + returnedStr.length()); } } /** * Return a String containing binary representations of * the chars in a String. */ String printIt(String arg) { char[] carray = arg.toCharArray(); StringBuffer bb = new StringBuffer(arg.length() * 5); for (int ii = 0; ii < arg.length(); ii++) { int ccc = arg.charAt(ii); bb.append(String.format("%1$04x ", ccc)); } return bb.toString(); } String printIt1(String arg) { byte[] barray = null; try { barray = arg.getBytes("UTF-8"); } catch (UnsupportedEncodingException ee) { } StringBuffer bb = new StringBuffer(barray.length * 3); for (int ii = 0; ii < barray.length; ii++) { bb.append(String.format("%1$02x ", barray[ii])); } return bb.toString(); } }