Shift_JISとCP932でマッピングが違う文字

違う文字一覧

1. —と―

2. 〜と～

3. ‖と∥

4. −と－

5. ¢と￠

6. £と￡

7. ¬と￢

どう確認するか

code:dos

ver

Microsoft Windows Version 10.0.14393

code:py

python.exe

Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) MSC v.1900 64 bit (AMD64) on win32

Type "help", "copyright", "credits" or "license" for more information.

なんか思てたんとちがう

0x815CをShift_JISでデコードしてもU+2014(EM DASH)にならず、U+2015(HORIZONTAL BAR)になった

CP932でデコードした結果はU+2015(HORIZONTAL BAR)で同じになった

それ以外は思った通りの文字が出た

0x815C

code:python

>> b'\x81\x5c'.decode('shift_jis')

'―'

>> hex(ord(b'\x81\x5c'.decode('shift_jis')))

'0x2015'

>> '―'.encode('utf_8')

b'\xe2\x80\x95'

>> b'\x81\x5c'.decode('cp932')

'―'

>> hex(ord(b'\x81\x5c'.decode('cp932')))

'0x2015'

>> '―'.encode('utf_8')

b'\xe2\x80\x95'

あれ？どちらも0x2015で返ってくる

0x2014はShift_JISにエンコードできない…

code:python

>> '—'.encode('utf_8')

b'\xe2\x80\x94'

>> '—'.encode('shift_jis')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

UnicodeEncodeError: 'shift_jis' codec can't encode character '\u2014' in position 0: illegal multibyte sequence

>> '—'.encode('cp932')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

UnicodeEncodeError: 'cp932' codec can't encode character '\u2014' in position 0: illegal multibyte sequence

EM DASH

— U+2014 Unicode文字 0g0.org

URLエンコード(UTF-8)%E2%80%94

HORIZONTAL BAR

― U+2015 Unicode文字 0g0.org

URLエンコード(UTF-8)%E2%80%95

URLエンコード(EUC-JP)%A1%BD

URLエンコード(SHIFT_JIS)%81%5C

0x8160

code:python

>> b'\x81\x60'.decode('shift_jis')

'〜'

>> hex(ord(b'\x81\x60'.decode('shift_jis')))

'0x301c'

>> '〜'.encode('utf_8')

b'\xe3\x80\x9c'

>> b'\x81\x60'.decode('cp932')

'～'

>> hex(ord(b'\x81\x60'.decode('cp932')))

'0xff5e'

>> '～'.encode('utf_8')

b'\xef\xbd\x9e'

WAVE DASH

〜 U+301C Unicode文字 0g0.org

URLエンコード(UTF-8)%E3%80%9C

URLエンコード(EUC-JP)%A1%C1

URLエンコード(SHIFT_JIS)%81%60

FULLWIDTH TILDE

～ U+FF5E Unicode文字 0g0.org

URLエンコード(UTF-8)%EF%BD%9E

0x8161

code:python

>> b'\x81\x61'.decode('shift_jis')

'‖'

>> hex(ord(b'\x81\x61'.decode('shift_jis')))

'0x2016'

>> '‖'.encode('utf_8')

b'\xe2\x80\x96'

>> b'\x81\x61'.decode('cp932')

'∥'

>> hex(ord(b'\x81\x61'.decode('cp932')))

'0x2225'

>> '∥'.encode('utf_8')

b'\xe2\x88\xa5'

DOUBLE VERTICAL LINE

‖ U+2016 Unicode文字 0g0.org

URLエンコード(UTF-8)%E2%80%96

URLエンコード(EUC-JP)%A1%C2

URLエンコード(SHIFT_JIS)%81a

PARALLEL TO

∥ U+2225 Unicode文字 0g0.org

URLエンコード(UTF-8)%E2%88%A5

0x817C

code:python

>> b'\x81\x7c'.decode('shift_jis')

'−'

>> hex(ord(b'\x81\x7c'.decode('shift_jis')))

'0x2212'

>> '−'.encode('utf_8')

b'\xe2\x88\x92'

>> b'\x81\x7c'.decode('cp932')

'－'

>> hex(ord(b'\x81\x7c'.decode('cp932')))

'0xff0d'

>> '－'.encode('utf_8')

b'\xef\xbc\x8d'

MINUS SIGN

− U+2212 Unicode文字 0g0.org

URLエンコード(UTF-8)%E2%88%92

URLエンコード(EUC-JP)%A1%DD

URLエンコード(SHIFT_JIS)%81%7C

FULLWIDTH HYPHEN-MINUS

－ U+FF0D Unicode文字 0g0.org

URLエンコード(UTF-8)%EF%BC%8D

0x8191

code:python

>> b'\x81\x91'.decode('shift_jis')

'¢'

>> hex(ord(b'\x81\x91'.decode('shift_jis')))

'0xa2'

>> '¢'.encode('utf_8')

b'\xc2\xa2'

>> b'\x81\x91'.decode('cp932')

'￠'

>> hex(ord(b'\x81\x91'.decode('cp932')))

'0xffe0'

>> '￠'.encode('utf_8')

b'\xef\xbf\xa0'

CENT SIGN

¢ U+00A2 Unicode文字 0g0.org

URLエンコード(UTF-8)%C2%A2

URLエンコード(EUC-JP)%A1%F1

URLエンコード(SHIFT_JIS)%81%91

FULLWIDTH CENT SIGN

￠ U+FFE0 Unicode文字 0g0.org

URLエンコード(UTF-8)%EF%BF%A0

0x8192

code:python

>> b'\x81\x92'.decode('shift_jis')

'£'

>> hex(ord(b'\x81\x92'.decode('shift_jis')))

'0xa3'

>> '£'.encode('utf_8')

b'\xc2\xa3'

>> b'\x81\x92'.decode('cp932')

'￡'

>> hex(ord(b'\x81\x92'.decode('cp932')))

'0xffe1'

>> '￡'.encode('utf_8')

b'\xef\xbf\xa1'

POUND SIGN

£ U+00A3 Unicode文字 0g0.org

URLエンコード(UTF-8)%C2%A3

URLエンコード(EUC-JP)%A1%F2

URLエンコード(SHIFT_JIS)%81%92

FULLWIDTH POUND SIGN

￡ U+FFE1 Unicode文字 0g0.org

URLエンコード(UTF-8)%EF%BF%A1

0x81CA

code:python

>> b'\x81\xca'.decode('shift_jis')

'¬'

>> hex(ord(b'\x81\xca'.decode('shift_jis')))

'0xac'

>> '¬'.encode('utf_8')

b'\xc2\xac'

>> b'\x81\xca'.decode('cp932')

'￢'

>> hex(ord(b'\x81\xca'.decode('cp932')))

'0xffe2'

>> '￢'.encode('utf_8')

b'\xef\xbf\xa2'

NOT SIGN

¬ U+00AC Unicode文字 0g0.org

URLエンコード(UTF-8)%C2%AC

URLエンコード(EUC-JP)%A2%CC

URLエンコード(SHIFT_JIS)%81%CA

FULLWIDTH NOT SIGN

￢ U+FFE2 Unicode文字 0g0.org

URLエンコード(UTF-8)%EF%BF%A2

参考

文字集合の包含関係とテストに使うべき文字 - miauのブログ

Java シフトJISの扱い - Qiita

table:quote

SJIS/MS932 SJISでデコード MS932でデコード

0x815c U+2014 : EM DASH "—" U+2015 : HORIZONTAL BAR "―"

0x8160 U+301c : WAVE DASH "〜" U+ff5e : FULLWIDTH TILDE "～"

0x8161 U+2016 : DOUBLE VERTICAL LINE "‖" U+2225 : PARALLEL TO "∥"

0x817c U+2212 : MINUS SIGN "−" U+ff0d : FULLWIDTH HYPHEN-MINUS "－"

0x8191 U+00a2 : CENT SIGN "¢" U+ffe0 : FULLWIDTH CENT SIGN "￠"

0x8192 U+00a3 : POUND SIGN "£" U+ffe1 : FULLWIDTH POUND SIGN "￡"

0x81ca U+00ac : NOT SIGN "¬" U+ffe2 : FULLWIDTH NOT SIGN "￢"

SJIS/MS932 →SJISでデコード →MS932でエンコード

0x815c U+2014 : EM DASH "—" 変換不能

0x8160 U+301c : WAVE DASH "〜" 変換不能

0x8161 U+2016 : DOUBLE VERTICAL LINE "‖" 変換不能

0x817c U+2212 : MINUS SIGN "−" 変換不能

0x8191 U+00a2 : CENT SIGN "¢" 0x8191

0x8192 U+00a3 : POUND SIGN "£" 0x8192

0x81ca U+00ac : NOT SIGN "¬" 0x81ca

SJIS/MS932 →MS932でデコード →SJISでエンコード

0x815c U+2015 : HORIZONTAL BAR "―" 変換不能

0x8160 U+ff5e : FULLWIDTH TILDE "～" 変換不能

0x8161 U+2225 : PARALLEL TO "∥" 変換不能

0x817c U+ff0d : FULLWIDTH HYPHEN-MINUS "－" 変換不能

0x8191 U+ffe0 : FULLWIDTH CENT SIGN "￠" 変換不能

0x8192 U+ffe1 : FULLWIDTH POUND SIGN "￡" 変換不能

0x81ca U+ffe2 : FULLWIDTH NOT SIGN "￢" 変換不能

SQL Serverで該当文字をinsertしてみた

code:sql

-- 一時テーブル作成

create table #t(vc varchar (4), nvc nvarchar(4));

-- tempdbの列情報確認

select

column_name as column_name

, data_type as type

, character_set_name as character_set

, collation_name as collation

, character_maximum_length as max_len

, character_octet_length as octet_len

from

tempdb.information_schema.columns;

truncate table #t;

insert into #t(vc) values ('—');

insert into #t(vc) values ('〜');

insert into #t(vc) values ('‖');

insert into #t(vc) values ('−');

insert into #t(vc) values ('¢');

insert into #t(vc) values ('£');

insert into #t(vc) values ('¬');

insert into #t(vc) values (N'—');

insert into #t(vc) values (N'〜');

insert into #t(vc) values (N'‖');

insert into #t(vc) values (N'−');

insert into #t(vc) values (N'¢');

insert into #t(vc) values (N'£');

insert into #t(vc) values (N'¬');

insert into #t(nvc) values ('—');

insert into #t(nvc) values ('〜');

insert into #t(nvc) values ('‖');

insert into #t(nvc) values ('−');

insert into #t(nvc) values ('¢');

insert into #t(nvc) values ('£');

insert into #t(nvc) values ('¬');

insert into #t(nvc) values (N'—');

insert into #t(nvc) values (N'〜');

insert into #t(nvc) values (N'‖');

insert into #t(nvc) values (N'−');

insert into #t(nvc) values (N'¢');

insert into #t(nvc) values (N'£');

insert into #t(nvc) values (N'¬');

-- insertした内容の確認

select

, len(vc) as length

, datalength(vc) as data_length

, cast(vc as varbinary(max)) as bin

from

where

vc is not null;

select

nvc

, len(nvc) as length

, datalength(nvc) as data_length

, cast(nvc as varbinary(max)) as bin

from

where

nvc is not null;

-- 同一セッション内で再実行する用にdrop

drop table tempdb.#t;

table:result

column_name type character_set collation max_len octet_len

vc varchar cp932 Japanese_CI_AS 4 4

nvc nvarchar UNICODE Japanese_CI_AS 4 8

table:result

vc length data_length bin

? 1 1 3F

￠ 1 2 8191

￡ 1 2 8192

￢ 1 2 81CA

? 1 1 3F

￠ 1 2 8191

￡ 1 2 8192

￢ 1 2 81CA

table:result

nvc length data_length bin

? 1 2 3F00

￠ 1 2 E0FF

￡ 1 2 E1FF

￢ 1 2 E2FF

— 1 2 1420

〜 1 2 1C30

‖ 1 2 1620

− 1 2 1222

¢ 1 2 A200

£ 1 2 A300

¬ 1 2 AC00

nchar, nvarchar型の列にNプリフィクスを付けてinsertすると意図した文字が入る(エンディアンに注意)

これは予想通り…というか入ってもらわなきゃ困るShift_JISとCP932でマッピングが違う文字#65d54924f458200000daf00e

それ以外は以下のようにinsertされる

1. ?になる

EM DASH(—)

WAVE DASH(〜)

DOUBLE VERTICAL LINE(‖)

2. 全角に変換される

Shift_JISにあってCP932にない文字をinsertしようとしているのだから1.はわかる

暗黙的に2.をやっているのは何の設定なんだ…？

setで何か暗黙の変換をやっている設定はないか？

SET ステートメント (Transact-SQL) - SQL Server | Microsoft Learn

なさそう

照合順序関係ある？

COLLATE (Transact-SQL) - SQL Server | Microsoft Learn

関係ないか？直接指定して表示させようとするとnullになる

code:sql

select

nchar (162) collate japanese_ci_as /* UNICODEにU+00a2(半角¢) */

, nchar (65504) collate japanese_ci_as /* UNICODEにU+ffe0(全角￠) */

, char (37249) collate japanese_ci_as /* cp932にShift_JISの0x8191(半角¢) */

;

table:result

__COLUMN1 __COLUMN2 __COLUMN3

¢ ￠ « NULL »

やっぱりinsert, update時に裏で何かしてる？

INSERT (Transact-SQL) - SQL Server | Microsoft Learn

Unicode文字データ型nchar、nvarchar、ntextを参照している場合は、'expression'の前に大文字の'N'を付ける必要があります。'N'が指定されていない場合、SQL Serverでは、文字列はデータベースまたは列の既定の照合順序に対応するコードページに変換されます。文字列がこのコードページにない場合は、失われます。

これはUNICODEの話だけど、コードページにない場合は失われるって書いてあるから、やっぱり?になるのはわかるんだよな

コードページで調べてみる？

どうやって調べるんだ…？

実行計画を見てみる

上記の半角¢のinsert文をSQL Serverの実行計画からクエリとパラメーターをSQLで取得するで取得してみた

code:xml

<ShowPlanXML xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan"

Version="1.6" Build="14.0.1000.169">

<Batch>

<StmtSimple StatementText="(@1 varchar(8000))INSERT INTO t(vc) values(@1)"

StatementId="1" StatementCompId="3" StatementType="INSERT"

RetrievedFromCache="true"

StatementSubTreeCost="0.0100022" StatementEstRows="1"

SecurityPolicyApplied="false"

StatementOptmLevel="TRIVIAL" QueryHash="0x5789..."

QueryPlanHash="0x3F93..."

CardinalityEstimationModelVersion="140">

<StatementSetOptions QUOTED_IDENTIFIER="true"

ARITHABORT="true"

CONCAT_NULL_YIELDS_NULL="true"

ANSI_NULLS="true"

ANSI_PADDING="true"

ANSI_WARNINGS="true"

NUMERIC_ROUNDABORT="false" />

<OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="205213" EstimatedPagesCached="51303"

EstimatedAvailableDegreeOfParallelism="2" MaxCompileMemory="2047608" />

</TraceFlags>

<RelOp NodeId="0" PhysicalOp="Table Insert" LogicalOp="Insert" EstimateRows="1" EstimateIO="0.01"

EstimateCPU="1e-006" AvgRowSize="9" EstimatedTotalSubtreeCost="0.0100022" Parallel="0"

EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row">

</Identifier>

</ScalarOperator>

</Convert>

</ScalarOperator>

</DefinedValue>

</DefinedValues>

</Identifier>

</ScalarOperator>

</Assign>

</MultipleAssign>

</ScalarOperator>

</ScalarExpressionList>

</ScalarOperator>

</SetPredicate>

</ScalarInsert>

</RelOp>

</ParameterList>

</QueryPlan>

</StmtSimple>

</Statements>

</Batch>

</BatchSequence>

</ShowPlanXML>

CONVERT_IMPLICITってことは暗黙の型変換が行われている？

実行計画の時点で￠は全角になっている

実行計画作成より前に変換されている？

〜をinsertしたときは実行計画の時点で?になっている

code:xml

</ParameterList>

あ、すごい勘違いしていたかもしれない

そもそもShift_JIS→CP932に変換する時点で全角になってるんだ

insertより前だ

参考

https://blog.mori-soft.com/entry/2021/10/14/214049

https://learn.microsoft.com/ja-jp/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16

https://learn.microsoft.com/ja-jp/search/?scope=sql&view=sql-server-ver16&terms=コード%20ページ%20932

https://learn.microsoft.com/ja-jp/search/?scope=sql&view=sql-server-ver16&terms=CONVERT_IMPLICIT

https://learn.microsoft.com/ja-jp/sql/integration-services/data-flow/transformations/character-map-transformation?view=sql-server-ver16

https://blog.engineer-memo.com/2012/12/16/暗黙の型変換を調べる方法の足掛かり/