Decoding error receiving messages

I’ve written a chat bot and it has been working fine until recently. I am not sure what has been added to the messages but I have been getting decode errors when receiving messages. I’ve written my bot in python and the error line is this one:
data = socket.recv(2048).decode(‘UTF-8’)
The errors I’m getting are:
‘utf-8’ codec can’t decode byte _ in position _ : unexpected end of data/ invalid start of data
I didn’t start getting these til recently this bot worked perfectly fine in December when I playing around with Twitch IRC bots.
Did something change in the messages where not everything is UTF-8 anymore.
Any help would be appreciated thank you.

We have not changed any encoding within the last 2 years afaik, and UTF-8 should encompass everything (I think). Can you spit out raw lines in like UTF-32 to see what’s up?

I wonder if the issue is caused by getting incomplete outputs from the IRC server. I’m not quite sure.
I’ll show a output of raw lines in UTF-32 when joining a channel:

error: ‘utf-32-le’ codec can’t decode bytes in position 0-3: code point not in range(0x110000)
error: ‘utf-32-le’ codec can’t decode bytes in position 0-3: code point not in range(0x110000)
error: ‘utf-32-le’ codec can’t decode bytes in position 0-3: code point not in range(0x110000)
error: ‘utf-32-le’ codec can’t decode bytes in position 0-3: code point not in range(0x110000)
error: ‘utf-32-le’ codec can’t decode bytes in position 0-3: code point not in range(0x110000)

Here’s UTF-8 of the same channel join:

:tmi.twitch.tv 001 sato_chat :Welcome, GLHF!
:tmi.twitch.tv 002 sato_chat :Your host is tmi.twitch.tv
:tmi.twitch.tv 003 sato_chat :This server is rather new
:tmi.twitch.tv 004 sato_chat :-
:tmi.twitch.tv 375 sato_chat :-
:tmi.twitch.tv 372 sato_chat :You are in a maze of twisty passages, all alike.
:tmi.twitch.tv 376 sato_chat :>
:tmi.twitch.tv CAP * ACK :twitch.tv/membership
:tmi.twitch.tv CAP * ACK :twitch.tv/tags
:tmi.twitch.tv CAP * ACK :twitch.tv/commands
:sato_chat!sato_chat@sato_chat.tmi.twitch.tv JOIN #dansgaming
@badges=;color=;display-name=sato_chat;emote-sets=0;mod=0;subscriber=0;user-type= :tmi.twitch.tv USERSTATE #dansgaming
@broadcaster-lang=;emote-only=0;followers-only=20;r9k=0;room-id=7236692;slow=5;subs-only=0 :tmi.twitch.tv ROOMSTATE #dansgaming
:sato_chat.tmi.twitch.tv 353 sato_chat = #dansgaming :ghentbot gibbed 9steven ascothero miturner moobot fur3x dansgaming analyticsbot
:sato_chat.tmi.twitch.tv 353 sato_chat = #dansgaming :sato_chat
:sato_chat.tmi.twitch.tv 366 sato_ch

Can you provide the raw binary bytes off the socket?

I’m going to add that I suspect you’ve received exactly 2048 bytes, and that the last X bytes are a partial code point.
As an example, © under UTF-8 is C6A9. And if the final byte in what you just received is C6?
b’\xc6’.decode(‘utf-8’)
UnicodeDecodeError: ‘utf8’ codex can’t decode byte 0xc6 in position 0: unexpected end of data

You need to use something like the codec module’s incremental decoder so it can join one socket read to another and recognize a stub code point split between recv calls.

1 Like

The incremental decoder fixed everything after running test yesterday. Thank you so much.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.