From owner-chemistry@ccl.net Tue Jul 22 12:51:01 2025 From: "Brian Skinn brian.skinn^^gmail.com" To: CCL Subject: CCL: Preservation of the CCL archive Message-Id: <-55406-250722011613-7763-FK7Fr8glX/P0fgdkOYwDHA!=!server.ccl.net> X-Original-From: Brian Skinn Content-Type: multipart/alternative; boundary="000000000000b0342f063a7d5895" Date: Tue, 22 Jul 2025 00:51:21 -0400 MIME-Version: 1.0 Sent to CCL by: Brian Skinn [brian.skinn() gmail.com] --000000000000b0342f063a7d5895 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable All, Alain and I have discussed the situation and we believe that the best path forward is to release the database with sender email addresses obfuscated in the same manner as have been on CCL, except that each unique email address will be obfuscated the same way each time. This will allow all messages from a single sender email address to be associated together, while preventing the addresses from being exploited by a bulk email harvester. While it would be nice to foil more motivated de-obfuscation, that's quite a challenging problem and more than either of us has margin for. CCL contributors past and present have already at least implicitly consented to the current obfuscation scheme, so re-using it will leverage that existing consent. Once Jan shuts off CCL to new messages, I'll complete ingestion of the tail of the list into the database and then work on publishing to Zenodo. Since it won't be possible to broadcast a link to the Zenodo record through CCL, what with it being shut down at that point, I've created a GitHub Gist ( https://gist.github.com/bskinn/18d9db72945794b5713108a6127fdf4f) to serve as a 'bulletin board' of sorts, to announce to interested individuals where they can find the database once it's published. Anyone is also welcome to email me directly at any time to inquire about the status of the database publication. Or, if anyone has questions/comments that would be suitable for public display, please feel free to post them in the comment thread on the Gist page. Best regards, Brian On Fri, Jul 11, 2025 at 6:49=E2=80=AFPM Alain Borel alain.borel..epfl.ch < owner-chemistry,,ccl.net> wrote: > On 09.07.2025 20:15, Brian Skinn brian.skinn[-]gmail.com wrote: > > However, my scraping code includes logic to de-obfuscate email > addresses, for the purpose of accurately associating different messages > sent from the same email account. Given that the database contains hundre= ds > (at least) of such de-obfuscated email addresses, I suspect there are > privacy considerations to any publication effort. The obfuscated email > addresses are arguably already public information, since they are publicl= y > posted and easy for a human to decode. But, the point of the obfuscation = is > making them difficult to exploit at-scale. A database with *de-obfuscated= * > email addresses would be much easier to exploit in an automated way, and > thus (I expect) has more significant privacy implications. > > I have neither the expertise to correctly handle release of this, if > special handling is required, nor the spare time or resources to figure o= ut > the correct approach. But, I am interested in working with others to make > it happen. > > Anyone interested, please let me know. > > Thanks for the initiative! > > Former computational chemist turned librarian & research data management > specialist here - I'd be happy to help. > > I would need to take a closer look at the obfuscated addresses to figure > out exactly what could/should be done, but it doesn't feel impossible at > all. > > ... and thanks to the list admins who managed it during all these years, > of course :-) > > > Best regards, > Alain Borel > Research data management specialist > Biblioth=C3=A8que de l=E2=80=99EPFL > Rolex Learning Center > --000000000000b0342f063a7d5895 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
All,

Alain and I have discus= sed the situation and we believe that the best path forward is to release t= he database with sender email addresses obfuscated in the same manner as ha= ve been on CCL, except that each unique email address will be obfuscated th= e same way each time.

This will allow all messages= from a single sender email address to be associated together, while preven= ting the addresses from being exploited by=C2=A0a bulk email harvester. Whi= le it would be nice=C2=A0to foil more motivated de-obfuscation, that's = quite a challenging problem and more than either of us has margin for. CCL = contributors past and present have already at least implicitly consented to= the current obfuscation scheme, so re-using it will leverage that existing= consent.

Once Jan shuts off CCL to new messages, = I'll complete ingestion of the tail of the list into the database and t= hen work on publishing to Zenodo. Since it won't be possible to broadca= st a link to the Zenodo record through CCL, what with it being shut down at= that point, I've created a GitHub Gist (https://gist.github.com/bskin= n/18d9db72945794b5713108a6127fdf4f) to serve as a 'bulletin board&#= 39; of sorts, to announce to interested individuals where they can find the= database once it's published.

Anyone is also = welcome to email me directly at any time to inquire about the status of the= database publication. Or, if anyone has questions/comments that would be s= uitable for public display, please feel free to post them in the comment th= read on the Gist page.

Best regards,
Bri= an


On Fri, Jul 11, 2025 at 6:49=E2=80=AFPM Alain Borel = alain.borel..epfl.ch <<= a href=3D"mailto:owner-chemistry,,ccl.net" target=3D"_blank">owner-chemistry= ,,ccl.net> wrote:
=20 =20
On 09.07.2025 20:15, Brian Skinn brian.skinn[-]gmail.co= m wrote:
=20
However, my=C2=A0scraping code includes logic to de-obfuscate email addresses,=C2=A0for the purpose of accurately associating different messages sent from the same email account. Given that the database contains hundreds (at least) of such de-obfuscated email addresses, I suspect there are privacy considerations to any publication effort. The obfuscated email addresses are arguably already public information, since they are publicly posted and easy for a human to decode. But, the point of the obfuscation is making them difficult to exploit at-scale. A database with *de-obfuscated* email addresses would be much easier to exploit in an automated way, and thus (I expect) has more significant privacy implications.

I have neither the expertise to correctly handle release of this, if special handling is required, nor the spare time or resources to figure=C2=A0out the correct approach. But, I am interested in working with others to make it happen.

Anyone interested, please let me know.

Thanks for the initiative!

Former computational chemist turned librarian & research data management specialist here - I'd be happy to help.

I would need to take a closer look at the obfuscated addresses to figure out exactly what could/should be done, but it doesn't feel impossible at all.

... and thanks to the list admins who managed it during all these years, of course :-)


Best regards,
Alain Borel
Research data management specialist
Biblioth=C3=A8que de l=E2=80=99EPFL
Rolex Learning Center
--000000000000b0342f063a7d5895--