Notepad2 ― Converting to a native Win32 Unicode application

As of April 2008, Notepad2 was still an MBCS (multi byte character set) or ANSI application, internally. This was simply determined by the development history: parts of the Notepad2 source code originate from my "Textview" program, and are more than 10 years old (from around 1995 to 1996). At that time, I didn't care about Unicode: I was happy when my programs were running smoothly on Windows 95 (Windows NT was just for rich professionals, anyway), and there was no such thing like Unicode in the Turbo Pascal for DOS world, where I made my first programming steps (a word to the wise: in return, I'm glad I didn't learn PHP, first, so at least I have a basic idea of what's a memory pointer).

Recently, I've been taking upon myself the adventure of converting Notepad2 to a native Win32 Unicode application (after Windows 9x support has already been dropped some time ago, for other reasons). I have already been creating Win32 Unicode applications from scratch, but porting more than 20'000 lines of C code to Unicode was a special challenge, indeed. All this was even made more complicated by the fact that the Scintilla editing component used by Notepad2 is working with MBCS strings, as well internally as with the APIs exposed to clients.

A Unicode version of Notepad2 is working with the native system encoding of the Windows NT platform, so it is able to handle Unicode file names, and the editor window and the dialog boxes accept Unicode input. However, case-insensitive text search for non-ASCII characters is not possible, still. This is due to a current limitation of the Scintilla editing component. I don't want to maintain my own branch of the Scintilla source code with extensions to the character matching engine to bypass this limitation, as I think it has much more favorable effects using the default Scintilla distribution and being able to update quickly.

Under the hood

Following is a summary of some points worth mentioning I have come across when converting Notepad2 from an MBCS to a native Win32 Unicode application. This is not intended to be a complete, professional guide for porting applications to Unicode, yet it may illustrate some details for those who are interested.

What exactly means "Unicode" in the Win32 world?

UTF-16 little-endian is the encoding standard at Microsoft (and in the Windows operating system).¹

¹ Globalization Step-by-Step - Unicode Enabled

Building a Unicode Application

There's two compile-time flags (preprocessor definitions): UNICODE for Win32 and _UNICODE for the C Runtime Library. They both need to be enabled.

Using `WCHAR` or `TCHAR`?

Replacing all variables of type char with TCHAR (or LPTSTR and LPCTSTR, as appropriate) and using the TEXT() macro both for single and double quoted strings is the best way to go. However, Notepad2 doesn't support Windows 9x any more, and I don't want to maintain an MBCS version any longer, so I decided to use WCHAR (and LPWSTR, LPCWSTR) along with L"Unicode" strings, directly.

The code that deals with Scintilla text is still using char. I favored using Win32 API string functions above the C Runtime Library, thus the "A" functions have to be called directly (like lstrlenA() instead of strlen()).

String pointer arithmetics

Fortunately, string pointer arithmetics are the same for char and WCHAR. This means a macro like

#define StrEnd(pStart) (pStart + lstrlen(pStart))

returns a pointer to the terminating zero for both MBCS and Unicode strings.

Calculating string sizes

Most Win32 API functions that copy results into a string buffer interpret the buffer size as number of bytes for the MBCS version, and number of characters for the Unicode version. Using something like

WCHAR wch[256];
DWORD cch = sizeof(wch) / sizeof(wch[0]);

to specify the buffer size is okay for MBCS and Unicode.

Adjustments are necessary when memory for strings is dynamically allocated. The size to be allocated is usually specified in bytes, so it has to be calculated as follows:

WCHAR *wch = LocalAlloc(LPTR, sizeof(WCHAR) * (cch)); // or (cch + 1) with null terminator

This affects malloc() and family, LocalAlloc(), GlobalAlloc(), but also CopyMemory(), MoveMemory() and ZeroMemory()!

Calculating the buffer size in characters to pass an allocated string to a Win32 API function works just the other way round:

DWORD cch = LocalSize(wch) / sizeof(WCHAR);

A safe and easy way to get a copy of a string is using StrDup() from the Shell Lightweight Utility Functions (SHLWAPI).

The Command Line

For the WinMain() function, the lpCmdLine parameter is always of type LPSTR. GetCommandLine() (or CommandLineToArgvW()) could be used to retrieve a Unicode string of the command line. However, lpCmdLine does not include the very first command argument (the module name), but the other functions do. I got this one quickly when the Unicode version of Notepad2 was always loading Notepad2.exe on startup...

Ini-File Encoding

For the ini-file encoding, there's the following rules: both the "A" and "W" functions of GetPrivateProfileString() & Co. (set and get functions) do not change the encoding of an existing ini-file. Calling SetPrivateProfileStringW() with a non-Unicode ini-file causes conversion of the data string, and may cause lo?? of data. To force non-existent ini-files being saved as Unicode, a UTF-16 little-endian byte-order-mark (FF FE) could be written to an empty file before calling any of the ini-file management functions.

Despite a minimal loss of performance on very slow drives and a few extra kilobytes (as most of the Notepad2 configuration data is just plain ASCII, resulting in a very compact file when saving as UTF-8) I decided to use UTF-16 as the ini-file encoding, for best compatibility with the Win32 ini-file manipulation APIs.

Details...

Don't forget checking the GetClipboardData(), SetClipboardData(), etc. function calls: you may need to change CF_TEXT to CF_UNICODETEXT.

And, I noticed a difference between the "A" and "W" versions of a function: SearchPathA() does not touch the output buffer if the function fails, whereas SearchPathW() sets it to an empty string. But if your code is cleaner than mine, you won't even notice!