I've just done some research (by stepping through with a debugger), and socket.getaddrinfo _does_ perform the encoding of non-ASCII characters:
In [7]: socket.getaddrinfo('www.\u2603.com', None)[0][4][0]
Out[7]: '185.53.178.7'
It does so using the 'idna' encoding:
In [2]: "www.☃.com".encode('idna')
Out[2]: b'www.xn--n3h.com'
which (unsurprisingly, given this bug) doesn't do anything to underscores:
In [4]: "www_foo.☃.com".encode('idna')
Out[4]: b'www_foo.xn--n3h.com'
So I believe the correct implementation of (a) would be to encode the URL ourselves, and then drop any invalid characters out. (We should check if there is any stdlib/requests functionality that already does this.)
I've just done some research (by stepping through with a debugger), and socket.getaddrinfo _does_ perform the encoding of non-ASCII characters:
In [7]: socket. getaddrinfo( 'www.\u2603. com', None)[0][4][0]
Out[7]: '185.53.178.7'
It does so using the 'idna' encoding:
In [2]: "www.☃. com".encode( 'idna')
Out[2]: b'www.xn--n3h.com'
which (unsurprisingly, given this bug) doesn't do anything to underscores:
In [4]: "www_foo. ☃.com". encode( 'idna') xn--n3h. com'
Out[4]: b'www_foo.
So I believe the correct implementation of (a) would be to encode the URL ourselves, and then drop any invalid characters out. (We should check if there is any stdlib/requests functionality that already does this.)