Wittenberg

Las Intereçuns Speciais/Special Interests => La Società Rexhital per l'Avançamaintsch del Säp/The Royal Society for the Advancement of Knowledge => Topic started by: Sir Lüc on December 13, 2022, 11:17:58 AM

Title: Streaming algorithms for center-based clustering in general metrics
Post by: Sir Lüc on December 13, 2022, 11:17:58 AM
Azul,

although not a Royal Society member, yesterday marked the official end of my academic career**, and I'd like to share my master's degree thesis for consumption by our scientific community. (The presentation was a bit less low-level and might have been more understandable by laypeople, but I don't have a full recording of it.)

Since my graduation coincided with Llimbaziuă, I'm extremely tempted to release at least some abridged version of it in Talossan. In the meantime though, here's the full stuff as submitted in English:

Streaming algorithms for center-based clustering in general metrics (https://thesis.unipd.it/bitstream/20.500.12608/40246/1/Badin_Luca.pdf)

Abstract: Center-based clustering is an important yet computationally difficult primitive in the realm of unsupervised learning and data analysis. We specifically focus on the k-median clustering problem in general metric spaces, in which one seeks to find a set of k centers, so to minimise the sum of distances from each point in the dataset to its closest center. In this thesis, we present and analyze efficient techniques to deal with the k-median clustering problem in the streaming setting – where the dataset is presented one point at a time and not accessible in its entirety* – by leveraging the dimensionality of the dataset's underlying metric space.

(*: not accessible in its entirety all at once, I should have written. Yay for writing it at some ungodly hour of the night :p)

(**: sike! in February 2023, I accepted a position as (non-postdoctoral) research fellow, so my academic career is not actually over yet.)
Title: Re: Streaming algorithms for center-based clustering in general metrics
Post by: Sir Lüc on December 13, 2022, 11:20:55 AM
Forgot to mention, there's an experimental part for which I developed some C++ code:
- the PLS+ algorithm for streaming k-median coreset construction
- the StreamingCWB algorithm for streaming coreset refinement
- the kmedian++ (approximation) algorithm for solving the k-median clustering problem (not part of the proposed techniques; used to get an estimation of the optimum clustering cost for comparison purposes)

Since code snippets have previously been shared to the Royal Society board, I might share that as well, once I clean up the code a bit and move the repo somewhere safe (idk how long my department will preserve my GitLab account)
Title: Re: Streaming algorithms for center-based clustering in general metrics
Post by: Baron Alexandreu Davinescu on December 13, 2022, 11:38:08 AM
CONGRATULATIONS!

And thank you for the excellent paper!  I look forward to reading it :)
Title: Re: Streaming algorithms for center-based clustering in general metrics
Post by: Sir Txec dal Nordselvă, UrB on December 13, 2022, 12:10:13 PM
While the vast majority of this might as well be written in ancient Latin, congrats on achieving your goal!
Title: Re: Streaming algorithms for center-based clustering in general metrics
Post by: Sir Lüc on December 13, 2022, 02:26:02 PM
Thank you ^^
Title: Re: Streaming algorithms for center-based clustering in general metrics
Post by: Breneir Tzaracomprada on December 14, 2022, 10:15:24 PM
Congratulations Luc!