Streaming algorithms for center-based clustering in general metrics

Started by Lüc, December 13, 2022, 11:17:58 AM

Previous topic - Next topic

Lüc

Azul,

although not a Royal Society member, yesterday marked the official end of my academic career, and I'd like to share my master's degree thesis for consumption by our scientific community. (The presentation was a bit less low-level and might have been more understandable by laypeople, but I don't have a full recording of it.)

Since my graduation coincided with Llimbaziuă, I'm extremely tempted to release at least some abridged version of it in Talossan. In the meantime though, here's the full stuff as submitted in English:

Streaming algorithms for center-based clustering in general metrics

Abstract: Center-based clustering is an important yet computationally difficult primitive in the realm of unsupervised learning and data analysis. We specifically focus on the k-median clustering problem in general metric spaces, in which one seeks to find a set of k centers, so to minimise the sum of distances from each point in the dataset to its closest center. In this thesis, we present and analyze efficient techniques to deal with the k-median clustering problem in the streaming setting – where the dataset is presented one point at a time and not accessible in its entirety* – by leveraging the dimensionality of the dataset's underlying metric space.

(*: not accessible in its entirety all at once, I should have written. Yay for writing it at some ungodly hour of the night :p)
Sir Lüc da Schir, UrB
Permanent Secretary of Backend Admin / Secretar Parmanint per l'Aðmistraziun del Backend

zïa, v'amic, ¡cóchenets non sint spiçăs!

Lüc

Forgot to mention, there's an experimental part for which I developed some C++ code:
- the PLS+ algorithm for streaming k-median coreset construction
- the StreamingCWB algorithm for streaming coreset refinement
- the kmedian++ (approximation) algorithm for solving the k-median clustering problem (not part of the proposed techniques; used to get an estimation of the optimum clustering cost for comparison purposes)

Since code snippets have previously been shared to the Royal Society board, I might share that as well, once I clean up the code a bit and move the repo somewhere safe (idk how long my department will preserve my GitLab account)
Sir Lüc da Schir, UrB
Permanent Secretary of Backend Admin / Secretar Parmanint per l'Aðmistraziun del Backend

zïa, v'amic, ¡cóchenets non sint spiçăs!

Baron Alexandreu Davinescu

CONGRATULATIONS!

And thank you for the excellent paper!  I look forward to reading it :)
Alexandreu Davinescu, Baron Davinescu del Vilatx Freiric del Vilatx Freiric es Guaír del Sabor Talossan

    Bitter struggles deform their participants in subtle, complicated ways. ― Zadie Smith
    Revolution is an art that I pursue rather than a goal I expect to achieve. ― Robert Heinlein


Sir Txec dal Nordselvă, UrB

While the vast majority of this might as well be written in ancient Latin, congrats on achieving your goal!
Sir Txec Róibeard dal Nordselvă, UrB, GST, O.SPM, SMM
Secretár d'Estat
Guaír del Sabor Talossan
The Squirrel Viceroy of Arms, The Rouge Elephant Herald, RTCoA
Justice Emeritus of the Uppermost Cort
Former Seneschal

Lüc

Sir Lüc da Schir, UrB
Permanent Secretary of Backend Admin / Secretar Parmanint per l'Aðmistraziun del Backend

zïa, v'amic, ¡cóchenets non sint spiçăs!

Breneir Tzaracomprada