>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.
This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.
SELECT
id,
text,
`by` AS username,
FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
FROM
`bigquery-public-data.hacker_news.full`
WHERE
type = 'comment'
AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
ORDER BY
time DESC
LIMIT
100
With a more straightforward approach, the tool can be reproduced with just a few queries in ClickHouse.
1. Create a table with styles by authors:
CREATE TABLE hn_styles (name String, vec Array(UInt32)) ENGINE = MergeTree ORDER BY name
2. Calculate and insert style vectors (the insert takes 27 seconds):
INSERT INTO hn_styles WITH 128 AS vec_size,
cityHash64(arrayJoin(tokens(lower(decodeHTMLComponent(extractTextFromHTML(text)))))) % vec_size AS n,
arrayMap((x, i) -> i = n, range(vec_size), range(vec_size)) AS arr
SELECT by, sumForEach(arr) FROM hackernews_history GROUP BY by
It works for me. The accounts I used long time ago are there in high positions.
I guess that my style is very distinctive.
But I also have seen some accounts that seem to be from other non-native English speakers. They may even have a Latin language as their native one (I just read some of their comments, and, at minimum, some of them seem to also be from the EU). So, I guess, that it is also grouping people by their native language other than English.
So, maybe, it is grouping many accounts by the shared bias of different native-languages. Probably, we make the same type of mistakes while using English.
My guess will be that native Indian or Chinese speakers accounts will also be grouped together, for the same reason. Even more so, as the language is more different to English and the bias probably stronger.
It would be cool that Australians, British, Canadians tried the tool. My guess is that the probability of them finding alt-accounts is higher as the populations is smaller and the writing more distinctive than Americans.
Thanks for sharing the projects. It is really interesting.
Also, do not trust the comments too much. There is an incentive to lie as to not acknowledge alt-accounts if they were created to remain hidden.
I noticed that it also depends on the vendor of the autocorrect/dictionary you're using.
The project referenced in the post put me next to Brits on the similarity list and indeed I am using an English(UK) dictionary. Meanwhile this iteration aligns me with Americans despite the only change being the vendor (formerly Samsung, now Google).
I guess the Samsung keyboard corrects to proper Bri'ish.
I picked up the language as a child from a collection of people, half of whom weren't native speakers, so I don't speak any specific dialect.
The matching score is probably the same, or very close in both ways, but this fact does not necessarily help in a three-way scenario:
A <-> B: 80%
A <-> C: 90%
B <-> C: 70%
When you search for A the best match will be C, but if you start with B it will be A. If one of the accounts has a smaller sample set as in GP's case, the gap could be quite big.
My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.
They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")
My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.
I would prefer the "analyze" feature focus on content rather than structure words. I forget the specific linguistic terms but to a first approximation, nouns and verbs would be of interest, prepositions and articles not. Let's call the former "syntactic" and the latter "semantic."
I suppose it's possible the "analyze"-reported proportions are a lot more precise and reliably diagnostic than I imagine. I haven't yet looked in detail at the statistical method.
Also, of course, it would require integration with NLP tooling such as WordNet (or whatever's SOTA there something like a decade and a half on) and a bit of Porter stemming to do part-of-speech tagging. If one 0.7GB dataset is heavyweight where this is running, that could be a nonstarter; stemming is trivial and I recall WordNet being acceptably fast if maybe memory hungry on a decade ago's kinda crappy laptop, but I could see it requiring some expensive materialization just to get datasets to inspect. (How exactly do we define "more common" for eg "smooth?" Versus semantic words, all words, both, or some combination? Do we need another dataset filtered to semantic words? Etc.)
If we're dreaming and I can also have a pony, then it would be neat to see both the current flavor, one focused on semantics as above, and one focused specifically on syntax as this one coincidentally often seems to act like. I would be tempted to offer an implementation, but I'm allergic to Python this decade.
Of course, immediately after the edit window closes, I revisit this comment and discover that in the first paragraph I swapped my terms and made a hash of the rest of the thing. Please cut out and paste into your printouts the following corrected version. Thank you!
> I would prefer the "analyze" feature focus on content rather than structure words. I forget the specific linguistic terms but to a first approximation, nouns and verbs would be of interest, prepositions and articles not. Let's call the former "semantic" and the latter "syntactic."
Should is a commonly used word and a fine one. You should feel free to use it. If someone gets hot under the collar because you said he should do something then he is an idiot.
"Ought to" is essentially a synonym. Anyone that gets upset when you said they should do something but is fine when you say that they ought to do something is truly a moron.
This isn't a habit of communication. I honestly mean it: if you get upset that someone said you "should" do something, but you are fine when they say you "ought to" do it, then you must be stupid. They mean the same thing in modern English.
Yes but words hold memories to others. Since 'ought to' is less frequently used it doesn't 'trigger' people the same way.
Most people are emotion-first, how the words make them feel is more important than the definitions of them. Being emotion-first doesn't make them stupid.
The only time to avoid command words like should is when the person could conceivably see them as a command. Because then you're being a dick.
Otherwise, if someone wants to take the time to dissect meaning from add-on meaningless words like should in a sentence, they should find something better to do with their time. Or just ask instead of being a moron.
How are you being a dick?! There are loads of reasons why you may want or need to instruct someone to do something. I prefer the imperative mood. It is more direct. "Sudo make me a cup of tea".
I (also?) felt the 'words used less often' were much easier to connect to as a conscious effort. I pointed chatgpt to the article and pasted in my results and asked it what it could surmise about my writing style based on that. It probably connected about as well as the average horoscope but was still pretty interesting!
I wouldn't take it that far, individual rights are important but so is functioning society. Counterculture is often a cyclic/youthful reaction to things that have been established as "better" or "proper" by older generations. Most rules for life such as the 10 Commandments have sort of a "deny youself some pleasure because it's better for the group" vibe that have been learned and re-learned over many centuries.
Interestingly, when most people simply choose to do what most people choose to do, you get an emergent 'herd mentality' which can lead to some very strange places. It is also sensitive to very small purturbations - which in real terms means, the one person who does manage to think for themselves may find they have an outsized effect on the direction of the crowd.
I think this mentality is also where the term 'sheeple' comes from.
Yes, "ought" is the past tense of "owe". At some point, the second alternative spelling "owed" was introduced to better separate the two meanings (literal and figurative), but it's still the same word; a similar thing happened with "flower" and "flour", those used to be interchangeable spellings of the same word but then somebody decided that the two meanings of that word should be separated and given specific spellings.
And the construct "you owe it to <person> to <verb>" still exists even today but is not nearly as popular as "you should <verb>" precisely because it has to state to whom exactly your owe the duty; with "should" it sounds like an impersonal, quasi-objective statement of fact which suits the manipulative uses much better.
I have occasionally used the construct “you owe it to yourself to X”. I think it works well at conveying the sentiment that the person in question may be missing out on something if they don’t do X.
“You should” has a much more generic and less persuasive sentiment. “Why should I?” is a common and easy response which now leaves the suggester having to defend their suggestion to a skeptical audience.
Good point about "should" - it's also a word that has lost its original meaning. Shall, should, will and would used to have different, more nuanced meanings comprared to how we tend to use them today.
The only place today I see "shall" used correctly where most would say "should" or "will," is in legal documents and signage.
I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.
You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.
( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )
Maybe there isn't enough data for each user for pairs, but I thought about mixing the two approaches (but had no time to do it), that is, to have 350 components like now, for the single word frequency, plus other 350 for the most common pairs frequency. In this way part of the vector would remain a high enough signal even for users with comparable less data.
I've been thinking some more about this, and it occurred to me that you'd want to encode sentence boundaries as a pseudo-word in the n-tuples.
I then realised that "[period] <word>" would likely dominate most common pairs, and that a lot of time could be saved by simply recording the first word of sentences as their own vector set, in addition but separate to the regular word vector.
Whether this would be a stronger or weaker signal per-vector-space than the tail of words in the regular common-words vector I don't know.
When I ran it, it gave me 20 random users, but when I do the analyze, it says my most common words are [they because then that but their the was them had], which is basically just the most common English words.
Probably would be good to exclude those most common words.
I had a similar result. 85%+ similar to a bunch of random accounts, and my perhaps most distinguishing feature is I don't use the word 'app' or 'company' a lot. The former because I dislike the word, and the latter maybe because I'm self-employed.
I figured it maybe would cluster me with other non-native speakers but it doesn't appear to. Of all the accounts where I could identify a country of origin, all were American.
This is great example of what's possible and how true anonymity, even online, is only "technological threshold" anonymity. People obsessed with biometrics might not consider this is another biometric.
Instead of just HN, now do it with the whole internet, imagine what you'd find. Then imagine that it's not being done already.
This technique yields so many false positives and negatives, it's practically useless. Possibly it works reliably for mono-lingual, prolific writers. Someone like the Qanon shaman (or whatever the name was) might be picked up, if it doesn't happen to be a collective.
Except that technology is on the side of anonymity this time. LLMs can provide a pretty solid defense against such attacks — just ask ChatGPT to rewrite your message in a random writer's style. The issue is that you'll end up sounding like an LLM, but hey, tradeoffs.
Using throwaways whenever possible mitigates a lot of the risk, too.
That’s true. The old security versus convenience hack.
But if i were a government agency I would be pressing AI providers for data, or fingerprinting the output with punctuation/whitespace or something more subtle.
Tho i guess with open models that people can run on device that’s mitigated a lot.
It did find my "alt" (really an old account with a lost password), but the rest of the list – all users with very high match scores (0.8+) – is random.
Taking a look at comments from those users, I think the issue is that the algorithm focuses too much on the topic of discussion rather than style. If you are often in conversations about LLMs or Musk or self driving cars then you will inevitably end up using a lot of similar words as others in the same discussions. There's only so many unique words you can use when talking about a technical topic.
I see in your post that you try to mitigate this by reducing the number of words compared, but I don't think that is enough to do the job.
I wonder how much curly quote usage influences things. I type things like curly quotes with my Compose key, and so do most of my top similars; and four or five words with straight quotes show up among the bottom ten in our analyses. (Also etc, because I like to write &c.)
I’m not going to try comparing it with normalising apostrophes, but I’d be interested how much of a difference it made. It could easily be just that the sorts of people who choose to write in curly quotes are more likely to choose words carefully and thus end up more similar.
Curly vs. straight quotes is mainly a mobile vs. desktop thing AFAIK. Not sure what Mac does by default, but Windows and Linux users almost exclusively use plain straight quotes everywhere.
My impression is that iOS is the only major platform to even support automatically curlify quotation marks. Maybe some Android keyboards are more sensible about it, but none that I’ve used make it anything but manual.
We can improve this. antirez has made a highly compelling poc but it could be refined for authorship attribution judging by the number of misses in the comments here, and how this compares to greater accuracy of the original post to which antirez refers. I’m no expert, but some ideas:
- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data
- remove stop words (NLP definition of stop words)
It wouldn't change anything though. Unless you delete your comment / account a few minutes after you post, it's gonna get scraped and saved into a DB almost instantly. After that, the fact that HN deleted them won't save you from this.
Managed to find an alt I forgot I made and gave up using years ago. I do wonder about other high up people. Like what about our mutual histories makes us have similar word usage? Are we from the same areas or did we hang out in similar places online?
> Please don't post unsubstantive comments to HN. [link to guidelines]
My guess is it was a parody/impersonator account.
You can enable "showdead" in your profile to see [dead] comments ans posts. Most of them are crap, but there are some false positives an errors from time to time.
I like to leave dead comments on. It helps me feel better about living in a cultural wasteland to see that people who probably live near urban centers can be just as dumb! It really does help me worry less.
This is a bit tangential but I've noticed lots of comments aping the style of Matt Walsh. Not just on HN either, but probably more here than other places I visit.
Anyway, I guess this would be useful cluster the "Matt Walsh"-y commenters together.
Have you tried to analyze whether there is a correlation between "closeness" according to this metric and how often users chat in the same thread? I recognize some usernames that are reported as being similar to me, I wonder if there's some kind of self-selection at play.
That makes me wonder two things.
Firstly, if your can use this to find LLM generated content, which I guess would need similar instructions. Imagine instructing it to talk like a pirate, it would be quite different from a generic response.
Secondly, if you want to make an alt account harder to cross-correlate with your main, would rewriting your comments with an LLM work against this method? And if so, how well?
don't site comment we here post that users against you're
Quite a stance, man :)
And me clearly inarticulate and less confident than some:
it may but that because or not and even these
I noticed that randomly remembered usernames tend to produce either lots of utility words like the above, or very few of them. Interestingly, it doesn't really correlate with my overall impression about them.
I think it would be interesting to run this tool against Reddit, 4chan and Tweeter to find astroturf accounts. Does it look like a real browser to those sites or would it be blocked?
I remember the original post the author is referring to. I was captivated by it and thought it was cool. When I ran the original mentioned in the post, it detected my one of my alt's that I forgot about. OP's newer implementation using different methodologies did not detect the alt. For reference, the alt was created in 2010 and the last post was in 2012. Perhaps my writing style has changed?
I usually just create a new account every time I get a new computer or reinstall OS. I thought most of the results here were noise, but after closer inspection it just found 10 accounts I forgot having. Actually incredible and a little scary how well it works.
I noticed that in my top 20 similar users, the similarity rank/score/whatever are all >~0.83. However, randomly sampling from users in this thread, some top 20s are all <~0.75, or all roughly 0.8, etc.
Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?
Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?
It's not "less unique" as the structure of the sentence is what matters: the syntax. But you simply tend to use words with balanced frequency. It's not a bad thing.
Yeah, definitely not a bad thing. This just piqued my curiosity and is in a field I'm not super familiar with, so I'm just trying to wrap my head around it.
I got 3 correct matches out of 20, and I've had about 6 accounts total (using one at a time), with at least a fair number of comments in each. I guess that means that my word choices are more outliers than yours or there is just more to match. So it's not really good enough to reliably identify alt accounts, but it is quite suggestive.
Actually, the way that these things work is usually by focusing exclusively on the usage patterns of very common (top 500) words. You get better results by ignoring content words in favor of the linking words.
When they are rarely used (a small amount of total words produced), they don't have meaningful statistical info for a match, unfortunately. A few users here reported finding actual duplicated accounts they used in the past.
I've had several accounts over the last decade, but this wasn't able to find any of the old ones, even after expanding the results to 50 users. I personally chalk it up to my own writing style changing (intentionally and unintentionally) over the years.
It looks like I don't use the word "and" very often. I do notice that I tend to avoid concatenating sentences like that, lthough it is likely that there just isn't enough data on my account as I haven't been on HN for that long.
I wonder if such an analysis could tease apart the authors of intentionally anonymous publications. Things like peer review notes for papers or legal opinions (afaik in countries that are not the USA, the authors of a dissenting supreme court decision are not named).
Interesting, the top 3 similar accounts to me are two USers and an Australian. I'm Romanian (and living in Romania). I probably read too many books and news in English :)
Well, and worked a lot with americans over text based communication...
I think an interesting use of this is potentially finding LLMs trained to have the style of a person. Unfortunately now, just because a post has my style it doesn't mean it was me. I promise I am not a bot. Honest.
particularly multi-dimension scaling, but personally I think tSNE plots are less pathological (they don't have as many of these crazy cusps that make me think it's projecting down from a higher-dimensional surface which is near-parallel to the page)
After processing documents with BERT I really like the clusters generated by the simple and old k-Means algorithm
It has the problem that it always finds 20 clusters if you set k=20 and a cluster which really oughta be one big cluster might get treated as three little clusters but the clusters I get from it reflect the way I see things.
Redis supports random projection to a lower dimensionality, but the reality is that projecting a 350d vector into 2d is nice but does not remotely captures the "reality" of what is going on. But still, it is a nice idea to use some time. However I would do that with more than 350 top words, since when I used 10k it strongly captured the interest more than the style, so 2D projection of this is going to be much more interesting I believe.
Your observation is really acute: the small difference is due to quantization. When we search for element A, that is int8 quantized by default, the code paths de-quantize it, then re-quantize it and searches. This produces a small loss of precision, like that:
redis-cli -3 VSIM hn_fingerprint ELE pg WITHSCORES | grep montrose
montrose 0.8640020787715912
redis-cli -3 VSIM hn_fingerprint ELE montrose WITHSCORES | grep pg
pg 0.8639097809791565
So why cosine similarity is commutative, the quantization steps lead to a small different result. But the difference is .000092 that is in practical terms not important. Redis can use non quantized vectors using the NOQUANT option in VADD, but this will make the vectors elements using 4 bytes per component: given that the recall difference is minimal, it is almost always not worth it.
My guess is that people from the same region and similar background will have more and closer "alters". So, if you are Californian-American then there is many people that will speak similar to you in HN. If you are a Satawalese speaker then you may be quite alone in your own group.
(The Satawalese language has 460 speakers, most of who live in Satawal Island in the Federated States of Micronesia.)
It's a fingerprinting tool, not a profiling tool. You can't draw such conclusions from it.
What a profiler would do to identify someone, I imagine, requires much more. Like the ability to recognize someone's tendency of playing the victim to leverage social advantage in awkward situations.
85% is surprisingly high for fingerprinting, hence self-deprecation over insulting the author by poking at efficacy. I wouldn't have expected my Australian spelling, Oxford comma, or cadence to be anything close to the Californian Rust enthusiasts I apparently match against. Especially as there's no normalization happening - so even the Burrows-Delta method shouldn't match my use of "gaol" or "humour" that often.
But, limiting to the top couple hundred words, probably does limit me to sounding like a pretentious dickhole, as I often use "however", "but", and "isn't". Corrections are a little too frequent in my post history.
I'd expect things might be a tiny bit looser with precisions if something small like stop words were removed. Though, it'd be interesting to do the opposite. If you were only measuring stopwords, would that show a unique cadence?
PhasmaFelis and mikeash have all matches mutual for the top 20, 30, 50 and 100. Are there other users like this? If so, how many? What's the significance of this, in terms of the shape of the graph?
tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the "20, 30, 50, 100" scale. Is there a way to describe the degree to which a user has this "I'm a relatively closer neighbour to them than they are to me" property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?
Thank you, tptacek. I was able to verify, thanks to the Internet Archive caching of "pg" for the post of 3 years ago, that the entries are quite similar in the case of "pg". Consider that it captures just the statistical patterns in very common words, so you are not likely to see users that you believe are "similar" to yourself. Notably: montrose may likely be a really be a secondary account of PG, and was also found as a cross reference in the original work of three years ago.
Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart.
Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity.
EDIT: at the end of the post I added the visual representations of pg and montrose.
If you want to do document similarity ranking in general it works to find nearby points in word frequency space but not as well as: (1) applying an autoencoder or another dimensional reduction technique to the vectors or (2) running a BERT-like model and pooling over the documents [1].
I worked on a search engine for patents that used the first, our evaluations showed it was much better than other patent search engines and we had no trouble selling it because customers could feel the difference in demos.
I tried dimensional reduction on the BERT vectors and in all cases I tried I found this made relevance worse. (BERT has learned a lot already which is being thrown away, there isn't more to learn from my particular documents)
I don't think either of these helps with the "finding articles authored by the same person" because one assumes the same person always uses the same words whereas documents about the topic use synonyms that will be turned up by (1) and (2). There is a big literature on the topic of determining authorship based on style
Indeed, but my problem is: all those vector databases (including Redis!) are always thought as useful in the context of learned embeddings, BERT, Clip, ... But I really wanted to show that vectors are very useful and interesting outside that space. Now, I also like encoders very well, but I have the feeling that the Vector Sets, as a data structure, needs to be presented as a general tool. So I really cherry picked a use case that I liked and where neural networks were not present. Btw, Redis Vector Sets support dimensionality reduction by random projection natively in the case the vector is too redundant. Yet, in my experiments, I found that using binary quantization (also supported) is a better way to save CPU/space compared to RP.
we have Dissociative Identity Disorder, I wonder if our different personalities would also have different fingerprints? we do have different writing styles
I discovered the data is available up to date. Maybe soon or later I'll repeat and extend the analysis, potentially also using multiple ways to compute the vectors, including SBERT (or better SModernBERT).
I typically abandon an account once I reach 500 karma since it unlocks the ability to downvote. I'm now very self conscious about the words I overuse...
I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.
cocktailpeanuts and I for example, mutually share some words like:
I noted the "analyze" feature didn't seem as useful as it could be because the majority of the words are common articles and conjunctions.
I'd like to see a version of analyze that filters out at least the following stop words: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
The system uses on purpose those simple words, since they are "tellers" of the style of the user in a context independent way. Burrows papers explain this very well, but in general we want to capture low-level structure, more than topics and exact non obvious words used. I tested the system with 10k words and removing the most common words, and you get totally different results (still useful, but not style matching), basically you get users grouped by interests.
>The system uses on purpose those simple words, since they are "tellers" of the style of the user in a context independent way.
Yes, that's good! I didn't state my interest clearly, though.
I'd like to see the "analyze" result with the stop words excluded,
not for the style comparison part,
but for the reasons you state and others.
I think grouping users by interests would be a more interesting application. Most users don't have multiple accounts, but everyone probably shares some interests with other users, whom they might enjoy discovering.
Pretty sure the point here is to demonstrate how governments or other surveillance orgs can easily find your alt accounts even if you use Tor or any number of security tools.
The original version nailed all of my accounts with terrifying accuracy. Since then I make a new account every few days or weeks. Against the rules I know. And I’ve learned a lot about HN IP tracking and funny shadowbanning-like tricks they play but dont cop to. Like I get different error messages based on the different banned ips I use. And j see different behavior and inconsistency with flagged messages (like one that got upvoted a day after it was flagged and not visible to other users).
>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.
This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.
https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...My favorite which is also up to date is the ClickHouse playground.
For example:
https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUICogRl...I subscribe to this issue to keep up with updates:
https://github.com/ClickHouse/ClickHouse/issues/29693#issuec...
And ofc, for those that don't know, the official API https://github.com/HackerNews/API
I didn't know there was an official API! This explains why the data is so readily available in many sources and formats. That's very cool.
With a more straightforward approach, the tool can be reproduced with just a few queries in ClickHouse.
1. Create a table with styles by authors:
2. Calculate and insert style vectors (the insert takes 27 seconds): 3. Find nearest authors (the query takes ~50 ms):…i can’t believe i’ve been running a script to ingest the data for the last six hours. thank you.
It works for me. The accounts I used long time ago are there in high positions. I guess that my style is very distinctive.
But I also have seen some accounts that seem to be from other non-native English speakers. They may even have a Latin language as their native one (I just read some of their comments, and, at minimum, some of them seem to also be from the EU). So, I guess, that it is also grouping people by their native language other than English.
So, maybe, it is grouping many accounts by the shared bias of different native-languages. Probably, we make the same type of mistakes while using English.
My guess will be that native Indian or Chinese speakers accounts will also be grouped together, for the same reason. Even more so, as the language is more different to English and the bias probably stronger.
It would be cool that Australians, British, Canadians tried the tool. My guess is that the probability of them finding alt-accounts is higher as the populations is smaller and the writing more distinctive than Americans.
Thanks for sharing the projects. It is really interesting.
Also, do not trust the comments too much. There is an incentive to lie as to not acknowledge alt-accounts if they were created to remain hidden.
I discover 2 people in my top 20 who I can bet are from the same country as me and it is not a big country.
> Probably, we make the same type of mistakes while using English.
That is most likely the case. Case in point: My native language doesn't have articles, so locally they're a common source of mistakes in English.
It would be fun to have a tool try guess your native language, based on your English writing.
I noticed that it also depends on the vendor of the autocorrect/dictionary you're using.
The project referenced in the post put me next to Brits on the similarity list and indeed I am using an English(UK) dictionary. Meanwhile this iteration aligns me with Americans despite the only change being the vendor (formerly Samsung, now Google).
I guess the Samsung keyboard corrects to proper Bri'ish.
I picked up the language as a child from a collection of people, half of whom weren't native speakers, so I don't speak any specific dialect.
Didn't catch my original account when I tried it, not anywhere in top 100.
But, if I do the reverse (search using my original account), this one shows up as #2.
The main difference between the accounts is this one has a lot more posts, and my original account was actively posting ~11 years ago.
I never knew A can be like B without B being like A.
The matching score is probably the same, or very close in both ways, but this fact does not necessarily help in a three-way scenario:
When you search for A the best match will be C, but if you start with B it will be A. If one of the accounts has a smaller sample set as in GP's case, the gap could be quite big.I'm still in disbelief. I think one should run the operation in reverse after obtaining result set.
The "analyze" feature works pretty well.
My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.
They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")
My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.
In case anyone cares.
I would prefer the "analyze" feature focus on content rather than structure words. I forget the specific linguistic terms but to a first approximation, nouns and verbs would be of interest, prepositions and articles not. Let's call the former "syntactic" and the latter "semantic."
I suppose it's possible the "analyze"-reported proportions are a lot more precise and reliably diagnostic than I imagine. I haven't yet looked in detail at the statistical method.
Also, of course, it would require integration with NLP tooling such as WordNet (or whatever's SOTA there something like a decade and a half on) and a bit of Porter stemming to do part-of-speech tagging. If one 0.7GB dataset is heavyweight where this is running, that could be a nonstarter; stemming is trivial and I recall WordNet being acceptably fast if maybe memory hungry on a decade ago's kinda crappy laptop, but I could see it requiring some expensive materialization just to get datasets to inspect. (How exactly do we define "more common" for eg "smooth?" Versus semantic words, all words, both, or some combination? Do we need another dataset filtered to semantic words? Etc.)
If we're dreaming and I can also have a pony, then it would be neat to see both the current flavor, one focused on semantics as above, and one focused specifically on syntax as this one coincidentally often seems to act like. I would be tempted to offer an implementation, but I'm allergic to Python this decade.
Of course, immediately after the edit window closes, I revisit this comment and discover that in the first paragraph I swapped my terms and made a hash of the rest of the thing. Please cut out and paste into your printouts the following corrected version. Thank you!
> I would prefer the "analyze" feature focus on content rather than structure words. I forget the specific linguistic terms but to a first approximation, nouns and verbs would be of interest, prepositions and articles not. Let's call the former "semantic" and the latter "syntactic."
Should is a commonly used word and a fine one. You should feel free to use it. If someone gets hot under the collar because you said he should do something then he is an idiot.
"Ought to" is essentially a synonym. Anyone that gets upset when you said they should do something but is fine when you say that they ought to do something is truly a moron.
This isn't a habit of communication. I honestly mean it: if you get upset that someone said you "should" do something, but you are fine when they say you "ought to" do it, then you must be stupid. They mean the same thing in modern English.
Yes but words hold memories to others. Since 'ought to' is less frequently used it doesn't 'trigger' people the same way.
Most people are emotion-first, how the words make them feel is more important than the definitions of them. Being emotion-first doesn't make them stupid.
Being so emotional they react wildly to one of the most common words in the English language does in fact make them stupid.
The only time to avoid command words like should is when the person could conceivably see them as a command. Because then you're being a dick.
Otherwise, if someone wants to take the time to dissect meaning from add-on meaningless words like should in a sentence, they should find something better to do with their time. Or just ask instead of being a moron.
How are you being a dick?! There are loads of reasons why you may want or need to instruct someone to do something. I prefer the imperative mood. It is more direct. "Sudo make me a cup of tea".
Most people are more moronic than one might think
I (also?) felt the 'words used less often' were much easier to connect to as a conscious effort. I pointed chatgpt to the article and pasted in my results and asked it what it could surmise about my writing style based on that. It probably connected about as well as the average horoscope but was still pretty interesting!
That's very interesting as I noticed that certain outliers seemed indeed conscious attempts.
Since you seem to care about your writing, I'm wondering why you used "that" here?
> I use "this" less frequently that I would otherwise
Isn't it "less than" as opposed to "less that"?
Typo. Good catch
I think “should” and “ought to” end up being equivalent.
I prefer to avoid such absolutes and portray causality instead.
For example, in place of “you should not do drugs at work” I prefer “if you take drugs at work you’ll get in trouble”.
They do, and your suggestion is a great alternative. I'll try to do more of that
I often times go back to replace the instances of "should" wiht "could" as I could not tell people what to do. “you could not do drugs at work”
Nothing wrong with implying that people ought to behave according to mainstream social norms.
Isn’t that the same as saying that counterculture, fringe culture, and subcultures ought not exist?
I wouldn't take it that far, individual rights are important but so is functioning society. Counterculture is often a cyclic/youthful reaction to things that have been established as "better" or "proper" by older generations. Most rules for life such as the 10 Commandments have sort of a "deny youself some pleasure because it's better for the group" vibe that have been learned and re-learned over many centuries.
Interestingly, when most people simply choose to do what most people choose to do, you get an emergent 'herd mentality' which can lead to some very strange places. It is also sensitive to very small purturbations - which in real terms means, the one person who does manage to think for themselves may find they have an outsized effect on the direction of the crowd.
I think this mentality is also where the term 'sheeple' comes from.
Now if you only underindex on "underindex"... There's a good alternative that everyone understands, "use less"
If we weren't on HN, I might be inclined to agree
> I prefer "ought to"
I too like when others use it, since a very easy and pretty universal retort against "you ought to..." is "No, I don't owe you anything".
Are you saying there's a connection between "ought" and "owe"? All I see is "I don't want to hear any criticism".
Yes, "ought" is the past tense of "owe". At some point, the second alternative spelling "owed" was introduced to better separate the two meanings (literal and figurative), but it's still the same word; a similar thing happened with "flower" and "flour", those used to be interchangeable spellings of the same word but then somebody decided that the two meanings of that word should be separated and given specific spellings.
And the construct "you owe it to <person> to <verb>" still exists even today but is not nearly as popular as "you should <verb>" precisely because it has to state to whom exactly your owe the duty; with "should" it sounds like an impersonal, quasi-objective statement of fact which suits the manipulative uses much better.
I have occasionally used the construct “you owe it to yourself to X”. I think it works well at conveying the sentiment that the person in question may be missing out on something if they don’t do X.
“You should” has a much more generic and less persuasive sentiment. “Why should I?” is a common and easy response which now leaves the suggester having to defend their suggestion to a skeptical audience.
Good point about "should" - it's also a word that has lost its original meaning. Shall, should, will and would used to have different, more nuanced meanings comprared to how we tend to use them today.
The only place today I see "shall" used correctly where most would say "should" or "will," is in legal documents and signage.
The etymology makes a connection through old English. Oxford dictionary also contains this meaning:
> used to indicate duty or correctness
A duty to others is something you owe them; think, a duty of care and its lack, which is negligence.
> Again, my thought on good, interesting writing is that these are to be avoided.
You mean, ”I think this should be avoided”? ;)
Nice one high five
I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.
You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.
( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )
Maybe there isn't enough data for each user for pairs, but I thought about mixing the two approaches (but had no time to do it), that is, to have 350 components like now, for the single word frequency, plus other 350 for the most common pairs frequency. In this way part of the vector would remain a high enough signal even for users with comparable less data.
I've been thinking some more about this, and it occurred to me that you'd want to encode sentence boundaries as a pseudo-word in the n-tuples.
I then realised that "[period] <word>" would likely dominate most common pairs, and that a lot of time could be saved by simply recording the first word of sentences as their own vector set, in addition but separate to the regular word vector.
Whether this would be a stronger or weaker signal per-vector-space than the tail of words in the regular common-words vector I don't know.
Maybe I talk too much on HN. :)
When I ran it, it gave me 20 random users, but when I do the analyze, it says my most common words are [they because then that but their the was them had], which is basically just the most common English words.
Probably would be good to exclude those most common words.
I had a similar result. 85%+ similar to a bunch of random accounts, and my perhaps most distinguishing feature is I don't use the word 'app' or 'company' a lot. The former because I dislike the word, and the latter maybe because I'm self-employed.
I figured it maybe would cluster me with other non-native speakers but it doesn't appear to. Of all the accounts where I could identify a country of origin, all were American.
Funnily enough, my top 10 words used less often are as follows:
you, are, have, they, at, an, we, if, do, to
I'm frankly not quite sure how I've avoided them given how common they are.
For visibility, here's the tool where you can enter your hn username:
https://antirez.com/hnstyle?username=pg&threshold=20&action=...
This is great example of what's possible and how true anonymity, even online, is only "technological threshold" anonymity. People obsessed with biometrics might not consider this is another biometric.
Instead of just HN, now do it with the whole internet, imagine what you'd find. Then imagine that it's not being done already.
None of my throwaways and not even my old account shows up. We are not at that level yet. ymmv.
This technique yields so many false positives and negatives, it's practically useless. Possibly it works reliably for mono-lingual, prolific writers. Someone like the Qanon shaman (or whatever the name was) might be picked up, if it doesn't happen to be a collective.
Except that technology is on the side of anonymity this time. LLMs can provide a pretty solid defense against such attacks — just ask ChatGPT to rewrite your message in a random writer's style. The issue is that you'll end up sounding like an LLM, but hey, tradeoffs.
Using throwaways whenever possible mitigates a lot of the risk, too.
That’s true. The old security versus convenience hack.
But if i were a government agency I would be pressing AI providers for data, or fingerprinting the output with punctuation/whitespace or something more subtle.
Tho i guess with open models that people can run on device that’s mitigated a lot.
It did find my "alt" (really an old account with a lost password), but the rest of the list – all users with very high match scores (0.8+) – is random.
Taking a look at comments from those users, I think the issue is that the algorithm focuses too much on the topic of discussion rather than style. If you are often in conversations about LLMs or Musk or self driving cars then you will inevitably end up using a lot of similar words as others in the same discussions. There's only so many unique words you can use when talking about a technical topic.
I see in your post that you try to mitigate this by reducing the number of words compared, but I don't think that is enough to do the job.
In case you haven't seen it, the author addressed this point of topic vs. style in a comment (albeit in a different context): https://news.ycombinator.com/item?id=43708474.
It did find an old account of mine that got banned, top of the list. I have to say 500 words for fingerprinting, that's mindblowing.
It focuses on topic a lot, that's true.
I wonder how much curly quote usage influences things. I type things like curly quotes with my Compose key, and so do most of my top similars; and four or five words with straight quotes show up among the bottom ten in our analyses. (Also etc, because I like to write &c.)
I’m not going to try comparing it with normalising apostrophes, but I’d be interested how much of a difference it made. It could easily be just that the sorts of people who choose to write in curly quotes are more likely to choose words carefully and thus end up more similar.
Curly vs. straight quotes is mainly a mobile vs. desktop thing AFAIK. Not sure what Mac does by default, but Windows and Linux users almost exclusively use plain straight quotes everywhere.
My impression is that iOS is the only major platform to even support automatically curlify quotation marks. Maybe some Android keyboards are more sensible about it, but none that I’ve used make it anything but manual.
We can improve this. antirez has made a highly compelling poc but it could be refined for authorship attribution judging by the number of misses in the comments here, and how this compares to greater accuracy of the original post to which antirez refers. I’m no expert, but some ideas:
- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data
- remove stop words (NLP definition of stop words)
- perform stemming/tokenization/depluralization etc (again, NLP standard)
- implement commutativity and transitivity in the similarity function
- consider words as hyperlinks to the sets of people who use them often enough, and do something Pageranky to refine similarity
- consider word bigrams, etc
- weight variations and misspellings higher as distinguishing signals
What are your ideas ?
This is exactly why HN needs to allow us to delete accounts.
It wouldn't change anything though. Unless you delete your comment / account a few minutes after you post, it's gonna get scraped and saved into a DB almost instantly. After that, the fact that HN deleted them won't save you from this.
Managed to find an alt I forgot I made and gave up using years ago. I do wonder about other high up people. Like what about our mutual histories makes us have similar word usage? Are we from the same areas or did we hang out in similar places online?
How does it find the high similarity between "dang" and "dangg" when the "dangg" account has no activity (like comments) at all?
https://antirez.com/hnstyle?username=dang&threshold=20&actio...
Probably it used to have when the database was created. Then the comments got removed.
It has a [dead] comment that says
> Please don't post unsubstantive comments to HN. [link to guidelines]
My guess is it was a parody/impersonator account.
You can enable "showdead" in your profile to see [dead] comments ans posts. Most of them are crap, but there are some false positives an errors from time to time.
I like to leave dead comments on. It helps me feel better about living in a cultural wasteland to see that people who probably live near urban centers can be just as dumb! It really does help me worry less.
> Most of them are crap, but there are some false positives an errors from time to time.
HN silently black holes any comment made through a VPN, so I would expect a decent amount of false positives.
There is one dead comment by that user that mimics a "dang" comment. https://news.ycombinator.com/item?id=16800379 ("Please don't post unsubstantive comments to HN.")
This is a bit tangential but I've noticed lots of comments aping the style of Matt Walsh. Not just on HN either, but probably more here than other places I visit.
Anyway, I guess this would be useful cluster the "Matt Walsh"-y commenters together.
Matt Walsh? I mean, sure, maybe he’s your guy. Or maybe he’s… not. Matt Levine, though, that’s the style to ape.
Have you tried to analyze whether there is a correlation between "closeness" according to this metric and how often users chat in the same thread? I recognize some usernames that are reported as being similar to me, I wonder if there's some kind of self-selection at play.
Maybe we like comments written closer to our style more, and that motivates us to respond to them.
That makes me wonder two things. Firstly, if your can use this to find LLM generated content, which I guess would need similar instructions. Imagine instructing it to talk like a pirate, it would be quite different from a generic response.
Secondly, if you want to make an alt account harder to cross-correlate with your main, would rewriting your comments with an LLM work against this method? And if so, how well?
Very cool. Also a bit surprising — two of my matches are people I know IRL.
Are you all from the same town? Another user reported this finding.
We had all met in the same city but weren't originally from there or live there any longer.
Maybe some "like attracts like" phenomena
Very interesting phenomenon. I feel like the term 'phenomenon' is too unsubstantial for something like this.
Dang's analysis was funny:
don't site comment we here post that users against you're
Quite a stance, man :)
And me clearly inarticulate and less confident than some:
it may but that because or not and even these
I noticed that randomly remembered usernames tend to produce either lots of utility words like the above, or very few of them. Interestingly, it doesn't really correlate with my overall impression about them.
I think it would be interesting to run this tool against Reddit, 4chan and Tweeter to find astroturf accounts. Does it look like a real browser to those sites or would it be blocked?
I remember the original post the author is referring to. I was captivated by it and thought it was cool. When I ran the original mentioned in the post, it detected my one of my alt's that I forgot about. OP's newer implementation using different methodologies did not detect the alt. For reference, the alt was created in 2010 and the last post was in 2012. Perhaps my writing style has changed?
I usually just create a new account every time I get a new computer or reinstall OS. I thought most of the results here were noise, but after closer inspection it just found 10 accounts I forgot having. Actually incredible and a little scary how well it works.
I noticed that in my top 20 similar users, the similarity rank/score/whatever are all >~0.83. However, randomly sampling from users in this thread, some top 20s are all <~0.75, or all roughly 0.8, etc.
Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?
Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?
It's not "less unique" as the structure of the sentence is what matters: the syntax. But you simply tend to use words with balanced frequency. It's not a bad thing.
Yeah, definitely not a bad thing. This just piqued my curiosity and is in a field I'm not super familiar with, so I'm just trying to wrap my head around it.
Thanks for the interesting tool!
I tried my name, and I don't think a single "match" is any of my (very rarely used) throw away alts ;) I guess I have a few people I talk like?
I got 3 correct matches out of 20, and I've had about 6 accounts total (using one at a time), with at least a fair number of comments in each. I guess that means that my word choices are more outliers than yours or there is just more to match. So it's not really good enough to reliably identify alt accounts, but it is quite suggestive.
I think if you rule out insanely common words, it might get scary accurate.
Actually, the way that these things work is usually by focusing exclusively on the usage patterns of very common (top 500) words. You get better results by ignoring content words in favor of the linking words.
Interesting, I think it also doesn't help that outside of a throw away on a blue moon, I don't really use alts...
When they are rarely used (a small amount of total words produced), they don't have meaningful statistical info for a match, unfortunately. A few users here reported finding actual duplicated accounts they used in the past.
I've had several accounts over the last decade, but this wasn't able to find any of the old ones, even after expanding the results to 50 users. I personally chalk it up to my own writing style changing (intentionally and unintentionally) over the years.
No matches higher than .7something and no mutual matches let's go boys I'm a special unique snowflake
Such a nice scientific way to detect and mute those who go against the agenda's grain, oh I mean don't contribute anything meaningful to the community
It looks like I don't use the word "and" very often. I do notice that I tend to avoid concatenating sentences like that, lthough it is likely that there just isn't enough data on my account as I haven't been on HN for that long.
There are some interesting similarities in o.g. accounts aaronsw, pg, and jedberg.
I wonder if such an analysis could tease apart the authors of intentionally anonymous publications. Things like peer review notes for papers or legal opinions (afaik in countries that are not the USA, the authors of a dissenting supreme court decision are not named).
This is so cool. The user who talks most like me, and I can confirm he does, is ajb257
Interesting, the top 3 similar accounts to me are two USers and an Australian. I'm Romanian (and living in Romania). I probably read too many books and news in English :)
Well, and worked a lot with americans over text based communication...
I think an interesting use of this is potentially finding LLMs trained to have the style of a person. Unfortunately now, just because a post has my style it doesn't mean it was me. I promise I am not a bot. Honest.
I knew that this was possible but I always thought it took much more... effort? How do we mitigate this, then? Run our posts through an LLM?
I rotate hn accounts every year or two. In my top 4, I found 3 old alts.
This is impressive and scary. Obviously I had to create a throwaway to say this.
I'm surprised no one has made this yet with a clustered visualization.
Personally I like this approach a lot
https://scikit-learn.org/stable/modules/generated/sklearn.ma...
I think other methods are more fashionable today
https://scikit-learn.org/stable/modules/manifold.html
particularly multi-dimension scaling, but personally I think tSNE plots are less pathological (they don't have as many of these crazy cusps that make me think it's projecting down from a higher-dimensional surface which is near-parallel to the page)
After processing documents with BERT I really like the clusters generated by the simple and old k-Means algorithm
https://scikit-learn.org/stable/modules/generated/sklearn.cl...
It has the problem that it always finds 20 clusters if you set k=20 and a cluster which really oughta be one big cluster might get treated as three little clusters but the clusters I get from it reflect the way I see things.
Redis supports random projection to a lower dimensionality, but the reality is that projecting a 350d vector into 2d is nice but does not remotely captures the "reality" of what is going on. But still, it is a nice idea to use some time. However I would do that with more than 350 top words, since when I used 10k it strongly captured the interest more than the style, so 2D projection of this is going to be much more interesting I believe.
Given that some matches are “mutual” and others are not, I don’t see how that could translate to a symmetric distance measure.
Imagine the 2D space, it also has the same property!
You have three points nearby, and a fourth a bit more distant. 4 best match is 1, but 1 best match is 2 and 3.
Good point, but the similarity score between mutual matches is still different, so it doesn’t seem to be a symmetric measure?
Your observation is really acute: the small difference is due to quantization. When we search for element A, that is int8 quantized by default, the code paths de-quantize it, then re-quantize it and searches. This produces a small loss of precision, like that:
redis-cli -3 VSIM hn_fingerprint ELE pg WITHSCORES | grep montrose
montrose 0.8640020787715912
redis-cli -3 VSIM hn_fingerprint ELE montrose WITHSCORES | grep pg
pg 0.8639097809791565
So why cosine similarity is commutative, the quantization steps lead to a small different result. But the difference is .000092 that is in practical terms not important. Redis can use non quantized vectors using the NOQUANT option in VADD, but this will make the vectors elements using 4 bytes per component: given that the recall difference is minimal, it is almost always not worth it.
Used More Often by dang.
don't +0.9339
Cool tool. It's a shame I don't have other accounts to test it.
It's also a tool for wannabe impersonators to hoan their writing style mimic skills!
I don't have other accounts, but still matched at 85+% accuracy for a half dozen accounts. Seems I don't have very original thoughts or writing style.
My guess is that people from the same region and similar background will have more and closer "alters". So, if you are Californian-American then there is many people that will speak similar to you in HN. If you are a Satawalese speaker then you may be quite alone in your own group.
(The Satawalese language has 460 speakers, most of who live in Satawal Island in the Federated States of Micronesia.)
You couldn't have just picked a European country, you had to flex on us with Satawalese? :)
It's a fingerprinting tool, not a profiling tool. You can't draw such conclusions from it.
What a profiler would do to identify someone, I imagine, requires much more. Like the ability to recognize someone's tendency of playing the victim to leverage social advantage in awkward situations.
85% is surprisingly high for fingerprinting, hence self-deprecation over insulting the author by poking at efficacy. I wouldn't have expected my Australian spelling, Oxford comma, or cadence to be anything close to the Californian Rust enthusiasts I apparently match against. Especially as there's no normalization happening - so even the Burrows-Delta method shouldn't match my use of "gaol" or "humour" that often.
But, limiting to the top couple hundred words, probably does limit me to sounding like a pretentious dickhole, as I often use "however", "but", and "isn't". Corrections are a little too frequent in my post history.
I'd expect things might be a tiny bit looser with precisions if something small like stop words were removed. Though, it'd be interesting to do the opposite. If you were only measuring stopwords, would that show a unique cadence?
I don't know dude, don't take it personally.
PhasmaFelis and mikeash have all matches mutual for the top 20, 30, 50 and 100. Are there other users like this? If so, how many? What's the significance of this, in terms of the shape of the graph?
tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the "20, 30, 50, 100" scale. Is there a way to describe the degree to which a user has this "I'm a relatively closer neighbour to them than they are to me" property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?
Searched my nearest neighbor and found someone who agrees with my political views.
This is an interesting and well-written post but the data in the app seems pretty much random.
Thank you, tptacek. I was able to verify, thanks to the Internet Archive caching of "pg" for the post of 3 years ago, that the entries are quite similar in the case of "pg". Consider that it captures just the statistical patterns in very common words, so you are not likely to see users that you believe are "similar" to yourself. Notably: montrose may likely be a really be a secondary account of PG, and was also found as a cross reference in the original work of three years ago.
Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart.
Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity.
EDIT: at the end of the post I added the visual representations of pg and montrose.
If you want to do document similarity ranking in general it works to find nearby points in word frequency space but not as well as: (1) applying an autoencoder or another dimensional reduction technique to the vectors or (2) running a BERT-like model and pooling over the documents [1].
I worked on a search engine for patents that used the first, our evaluations showed it was much better than other patent search engines and we had no trouble selling it because customers could feel the difference in demos.
I tried dimensional reduction on the BERT vectors and in all cases I tried I found this made relevance worse. (BERT has learned a lot already which is being thrown away, there isn't more to learn from my particular documents)
I don't think either of these helps with the "finding articles authored by the same person" because one assumes the same person always uses the same words whereas documents about the topic use synonyms that will be turned up by (1) and (2). There is a big literature on the topic of determining authorship based on style
https://en.wikipedia.org/wiki/Stylometry
[1] With https://sbert.net/ this is so easy.
Indeed, but my problem is: all those vector databases (including Redis!) are always thought as useful in the context of learned embeddings, BERT, Clip, ... But I really wanted to show that vectors are very useful and interesting outside that space. Now, I also like encoders very well, but I have the feeling that the Vector Sets, as a data structure, needs to be presented as a general tool. So I really cherry picked a use case that I liked and where neural networks were not present. Btw, Redis Vector Sets support dimensionality reduction by random projection natively in the case the vector is too redundant. Yet, in my experiments, I found that using binary quantization (also supported) is a better way to save CPU/space compared to RP.
Did not find any of the alt accounts I've used since 2007. Which is good.
we have Dissociative Identity Disorder, I wonder if our different personalities would also have different fingerprints? we do have different writing styles
[flagged]
[flagged]
[flagged]
You've crossed into personal attack here. We ban accounts that do that, and you've been doing it in other threads too—for example:
https://news.ycombinator.com/item?id=43662951
https://news.ycombinator.com/item?id=43662889
If you keep this up, we're going to have to ban you again.
If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.
I wonder if it could help improve my karma? 8-/
My highest match was ChatGPT. Oh well
Edit: ChatGTP, my bad
so the website processes only comments older than 2023?
not very useful for more newer users like me :/
I discovered the data is available up to date. Maybe soon or later I'll repeat and extend the analysis, potentially also using multiple ways to compute the vectors, including SBERT (or better SModernBERT).
I don't mind revealing my alts since none of them seem to link back to my main. But the top 4 results were all correct for me:
https://antirez.com/hnstyle?username=gfd&threshold=20&action...
zawerf (Similarity: 0.7379)
ghj (Similarity: 0.7207)
fyp (Similarity: 0.7197)
uyt (Similarity: 0.7052)
I typically abandon an account once I reach 500 karma since it unlocks the ability to downvote. I'm now very self conscious about the words I overuse...
It does seem like burner accounts are effective if each account is not overused.
this got two accounts that I used to use
Great! Thanks for the ACK.
fun project! but it didn't get any of my alts.
[dead]
Well, well, well, cocktailpeanuts. :spiderman_pointing:
I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.
cocktailpeanuts and I for example, mutually share some words like:
because, people, you're, don't, they're, software, that, but, you, want
Unfortunately, this is a forum where people will use words like "because, people, and software."
Because, well, people here talk about software.
<=^)
Edit: Neat work, nonetheless.
I noted the "analyze" feature didn't seem as useful as it could be because the majority of the words are common articles and conjunctions. I'd like to see a version of analyze that filters out at least the following stop words: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
The system uses on purpose those simple words, since they are "tellers" of the style of the user in a context independent way. Burrows papers explain this very well, but in general we want to capture low-level structure, more than topics and exact non obvious words used. I tested the system with 10k words and removing the most common words, and you get totally different results (still useful, but not style matching), basically you get users grouped by interests.
>The system uses on purpose those simple words, since they are "tellers" of the style of the user in a context independent way.
Yes, that's good! I didn't state my interest clearly, though. I'd like to see the "analyze" result with the stop words excluded, not for the style comparison part, but for the reasons you state and others.
I think grouping users by interests would be a more interesting application. Most users don't have multiple accounts, but everyone probably shares some interests with other users, whom they might enjoy discovering.
Pretty sure the point here is to demonstrate how governments or other surveillance orgs can easily find your alt accounts even if you use Tor or any number of security tools.
There's already plenty of those running everywhere.
That seems to be a misconception.
The usage frequency of simple words is a powerful tell.
I can understand the nuance of your assertion, but looking at the data returned by these results suggests it's not really all that powerful at all.
There are so many people that write like me apparently, that simple language seems more like a way to mask yourself in a crowd.
You can definitely mask writing style. If you can do that only by using simple words, I am not so sure.
Indeed, some writing styles make frequent use of words like "that" and "just".
The original version nailed all of my accounts with terrifying accuracy. Since then I make a new account every few days or weeks. Against the rules I know. And I’ve learned a lot about HN IP tracking and funny shadowbanning-like tricks they play but dont cop to. Like I get different error messages based on the different banned ips I use. And j see different behavior and inconsistency with flagged messages (like one that got upvoted a day after it was flagged and not visible to other users).
What you're doing makes HN worse, unfortunately.
Care to write it all up somewhere? Sounds interesting