Tensorflow `TextVectorization` 无法处理 UTF-8 编码。
Tensorflow `TextVectorization` 无法处理 UTF-8 编码。
在尝试将文本向量化层适应到UTF-8编码的词汇表时发生以下错误:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2122' in position 49: ordinal not in range(128)
输入被编码为UTF-8。根据我的理解,这应该与Tensorflow兼容。
环境:
Tensorflow 版本:2.10.0
Python 版本:3.10.8
平台:Linux
完整的错误跟踪:
InvalidArgumentError Traceback (most recent call last) Cell In [10], line 8 1 VOCAB_SIZE = 5000 3 encoder = tf.keras.layers.TextVectorization( 4 max_tokens=VOCAB_SIZE, 5 standardize="lower", 6 ) ----> 8 encoder.adapt(train.map( 9 lambda doc, label : doc 10 )) File ~/.local/lib/python3.10/site-packages/keras/layers/preprocessing/text_vectorization.py:467, in TextVectorization.adapt(self, data, batch_size, steps) 417 def adapt(self, data, batch_size=None, steps=None): 418 """Computes a vocabulary of string terms from tokens in a dataset. 419 420 Calling `adapt()` on a `TextVectorization` layer is an alternative to (...) 465 argument is not supported with array inputs. 466 """ --> 467 super().adapt(data, batch_size=batch_size, steps=steps) File ~/.local/lib/python3.10/site-packages/keras/engine/base_preprocessing_layer.py:258, in PreprocessingLayer.adapt(self, data, batch_size, steps) 256 with data_handler.catch_stop_iteration(): 257 for _ in data_handler.steps(): --> 258 self._adapt_function(iterator) 259 if data_handler.should_sync: 260 context.async_wait() File ~/.local/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback..error_handler(*args, **kwargs) 151 except Exception as e: 152 filtered_tb = _process_traceback_frames(e.__traceback__) --> 153 raise e.with_traceback(filtered_tb) from None 154 finally: 155 del filtered_tb File ~/.local/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 52 try: 53 ctx.ensure_initialized() ---> 54 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, 55 inputs, attrs, num_outputs) 56 except core._NotOkStatusException as e: 57 if name is not None: InvalidArgumentError: Graph execution error: 2 root error(s) found. (0) INVALID_ARGUMENT: UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128) Traceback (most recent call last): File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in __call__ return [self._convert(x) for x in ret] File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in return [self._convert(x) for x in ret] File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 237, in _convert return result.astype(np.bytes_) UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128) [[{{node PyFunc}}]] [[IteratorGetNext]] [[UniqueWithCounts/_6]] (1) INVALID_ARGUMENT: UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128) Traceback (most recent call last): File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in __call__ return [self._convert(x) for x in ret] File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in return [self._convert(x) for x in ret] File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 237, in _convert return result.astype(np.bytes_) UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128) [[{{node PyFunc}}]] [[IteratorGetNext]] 0 successful operations. 0 derived errors ignored. [Op:__inference_adapt_step_195]