Tensorflow `TextVectorization` 无法处理 UTF-8 编码。

10 浏览
0 Comments

Tensorflow `TextVectorization` 无法处理 UTF-8 编码。

在尝试将文本向量化层适应到UTF-8编码的词汇表时发生以下错误:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2122' in position 49: ordinal not in range(128)


输入被编码为UTF-8。根据我的理解,这应该与Tensorflow兼容。

环境:

Tensorflow 版本:2.10.0
Python 版本:3.10.8
平台:Linux
完整的错误跟踪:

InvalidArgumentError                      Traceback (most recent call last)
Cell In [10], line 8
      1 VOCAB_SIZE = 5000
      3 encoder = tf.keras.layers.TextVectorization(
      4     max_tokens=VOCAB_SIZE,
      5     standardize="lower",
      6 )
----> 8 encoder.adapt(train.map(
      9     lambda doc, label : doc
     10 ))
File ~/.local/lib/python3.10/site-packages/keras/layers/preprocessing/text_vectorization.py:467, in TextVectorization.adapt(self, data, batch_size, steps)
    417 def adapt(self, data, batch_size=None, steps=None):
    418     """Computes a vocabulary of string terms from tokens in a dataset.
    419 
    420     Calling `adapt()` on a `TextVectorization` layer is an alternative to
   (...)
    465           argument is not supported with array inputs.
    466     """
--> 467     super().adapt(data, batch_size=batch_size, steps=steps)
File ~/.local/lib/python3.10/site-packages/keras/engine/base_preprocessing_layer.py:258, in PreprocessingLayer.adapt(self, data, batch_size, steps)
    256 with data_handler.catch_stop_iteration():
    257     for _ in data_handler.steps():
--> 258         self._adapt_function(iterator)
    259         if data_handler.should_sync:
    260             context.async_wait()
File ~/.local/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback..error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb
File ~/.local/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     52 try:
     53   ctx.ensure_initialized()
---> 54   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     55                                       inputs, attrs, num_outputs)
     56 except core._NotOkStatusException as e:
     57   if name is not None:
InvalidArgumentError: Graph execution error:
2 root error(s) found.
  (0) INVALID_ARGUMENT:  UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
Traceback (most recent call last):
  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in __call__
    return [self._convert(x) for x in ret]
  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in 
    return [self._convert(x) for x in ret]
  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 237, in _convert
    return result.astype(np.bytes_)
UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
     [[{{node PyFunc}}]]
     [[IteratorGetNext]]
     [[UniqueWithCounts/_6]]
  (1) INVALID_ARGUMENT:  UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
Traceback (most recent call last):
  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in __call__
    return [self._convert(x) for x in ret]
  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in 
    return [self._convert(x) for x in ret]
  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 237, in _convert
    return result.astype(np.bytes_)
UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
     [[{{node PyFunc}}]]
     [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_adapt_step_195]

0