-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement]: Implement hashing for Mii Images #144
Comments
Updated your issue to remove the compression part, as that's now been added as it's own issue #147 |
Trying to think of the best way to do this. Ideally the way this would work is by not hashing the entire Mii contents, but only the sections that affect appearance. Mii data has sections like device origin, names, etc. which aren't relevant to the output image The official servers have incredibly small hashes. One hash from my dumps is
I have no idea how they got the hash so small, though? Nothing I've tried has gotten anywhere close to as small as the originals According to https://www.3dbrew.org/wiki/Mii there are between 46 and 48 possible sections of the Mii data which are relevant here (I say 46-48 because there's 2 which might change the pants color but I forgot which, if either, do that):
Given that these are all just numbers (treating the booleans as 0 or 1) we could just not "hash" at all and just store these, in order, as a hex string. But that would result in a 96 character hash, which is huge compared to the official hashes We could try to pack them back into bit fields for more efficiency, like the original data (or just extract it out to begin with), but that would still leave us a hash that's around the size of the Mii data itself, which is again huge compared to the official hashes We could pack the bits into bytes and ignore byte alignments, which is better in terms of size, but still nowhere near the size of the official hashes and the size of the hash would vary depending on the input whereas the official hashes are always the same length (if all fields are at their max sizes, then the hash is 24 bytes) We can lower the size down using something like base62 (since the official hashes seem to only be alpha-numeric) and in our worst-case scenario with ignoring byte alignments we get a 33 character hash const base62 = require('@fry/base62');
// Mii appearance values, in the order listed previously
const values = [
1, 1, 1, 11, 127, 127, 11, 6, 11, 11, 131, 7, 1, 59, 5, 7, 6, 7, 12, 18, 24, 7, 8, 6, 11, 12, 18, 17, 8, 18, 35, 4, 8, 6, 18, 5, 5, 7, 8, 16, 8, 5, 7, 20, 1, 8, 16, 30
];
function bitStringToBuffer(bitString) {
const remainder = bitString.length % 8;
if (remainder !== 0) {
bitString = bitString.padEnd(bitString.length + (8 - remainder), '0');
}
const buffer = Buffer.alloc(bitString.length / 8);
for (let i = 0; i < bitString.length; i += 8) {
const byteString = bitString.slice(i, i + 8);
buffer[i / 8] = parseInt(byteString, 2);
}
return buffer;
}
let bits = '';
for (const value of values) {
// Fuck aligning bytes
bits += value.toString(2);
}
const buffer = bitStringToBuffer(bits)
console.log(buffer.toString('hex')); // f7fffdebb83feefdf258f1af2518947234adf108be988780
console.log(buffer.toString('base64')); // 9//967g/7v3yWPGvJRiUcjSt8Qi+mIeA
console.log(base62.encode(buffer)); // yzzzthk3ztVlaiUDUIZ4ev6bRuX5yc8U0
console.log(buffer.length); // 24 I've done some experimenting with perceptual hashing as well, which DOES get the hash down to 11 characters when using base62: const fs = require('node:fs');
const phash = require('sharp-phash');
const base62 = require('@fry/base62');
function bitStringToBuffer(bitString) {
const remainder = bitString.length % 8;
if (remainder !== 0) {
bitString = bitString.padEnd(bitString.length + (8 - remainder), '0');
}
const buffer = Buffer.alloc(bitString.length / 8);
for (let i = 0; i < bitString.length; i += 8) {
const byteString = bitString.slice(i, i + 8);
buffer[i / 8] = parseInt(byteString, 2);
}
return buffer;
}
async function main() {
const mii1 = fs.readFileSync('./k9k8yk4qwtqk_standard.png');
const mii2 = fs.readFileSync('./u2jg043u028x_standard.png');
const mii3 = fs.readFileSync('./u4w5ibugms72_standard.png');
const bits1 = await phash(mii1);
const bits2 = await phash(mii2);
const bits3 = await phash(mii3);
const hash1 = bitStringToBuffer(bits1);
const hash2 = bitStringToBuffer(bits2);
const hash3 = bitStringToBuffer(bits3);
console.log(hash1.toString('hex')); // 20dac8682025809a
console.log(hash1.toString('base64')); // INrIaCAlgJo=
console.log(base62.encode(hash1)); // 8Dh8Q20bW9A
console.log(hash2.toString('hex')); // 14fbcbd82063901a
console.log(hash2.toString('base64')); // FPvL2CBjkBo=
console.log(base62.encode(hash2)); // 5FlBs21Za1A
console.log(hash3.toString('hex')); // 0cd3c1480c2531b2
console.log(hash3.toString('base64')); // DNPBSAwlMbI=
console.log(base62.encode(hash3)); // 3DF1I0mbCR2
}
main(); However, my concern with this is that I have no idea if
Which DOES tell me that there will be collisions? Perceptual hashing might work, but only if we can guarantee there won't be collisions, but that comes at the cost of longer hashes? Honestly I'm very stumped as to how Nintendo calculated these hashes, even at 12 characters that's 62^12 combinations, which still falls WAY SHORT of what @HEYimHeroic calculated for even the Wii? |
Correction if you didn't realize it, every "Mii hash" as NNAS called them were unique for each user and permanent. Even if they wanted to do something like what you're mentioning, if you test yourself NNAS doesn't actually verify Mii data beyond the CRC16 (so if you inject random data with the same length and valid CRC, it'll set it but not render) |
I don't think they're hashes at all, just IDs. Why hash at all? IMO, you should treat the 'hash' as a primary key to a database object that actually stores all of the Mii's data, fetch that on request, serve the images, and use whatever frontend caching proxy you like with a Cache-Control header to cache the images. No risk of collision that way, and you retain enough control over cache-busting/TTL to avoid serving stale data. |
This issue was made as a public place to document something we've already discussed internally, so in all fairness the details here are somewhat lacking Right now for every user we render each Mii individually and store it based on the owning user's unique PID. This works well for when we need to do things like quickly query for a specific users Mii, since PIDs are public, but it also means we end up doing a lot of duplicate work and storing a lot of duplicate renders The idea behind the hashing is to only render each unique Mii once, and then never need to render it again. Rather than referencing each render by the owning user's PID, it would be referenced by its hash, and each user would simply be assigned the hash of the Mii render. Right now we use Mii Studio to render images (though even once we move off of Mii Studio, this still saves us on storage and bandwidth), and so to reduce load (both on Nintendo's servers and on ours), which prevents us from making "duplicate" renders A huge amount of our users share Mii's with many other users, mostly being the default Mii and ones from those "how to make whatever Mii" YouTube videos that were hugely popular years ago. Only rendering these Mii's a single time lets us only store one copy of these Miis rather than many. Right now if we have 500,000 users, then we need to store 500,000 renders, no matter what. By reducing the duplication this way, the only time we approach a number of renders equal to the number of users is in the worst case scenario, and even then that "worst case" is just back to the way we were doing it before. Doing it this way only gets us positive gains at best, and at worst it has had functionally no change (though we know it won't be the worst case scenario, since we already know there's a lot of duplication) It also means that if someone uses Mii1, changes to Mii2, and then back to Mii1 (which does happen), we've now rendered Mii1 twice even for the same user. By only rendering each Mii once, this becomes only 2 renders at worst rather than 3 Storing and referencing renders by the hash, rather than the PID, also lets us do more aggressive caching in Cloudflare. When referenced by the owning users PID, it meant that we had to take into account Mii changes for our caching, since changing the render the PID pointed to would invalidate the cache. But if renders are instead referenced by their own unique hash we can be much more aggressive in our caching and not need to worry about users changing their Miis, since the render a hash references will never change and a user changing Miis will just boil down to changing which render it uses. Doing it this way effectively lets us offload all Mii image serving onto Cloudflare as we can just have it cache the images for the max amount of time and forget about it so the requests never end up hitting our origin/storage servers at all once requested a single time The "hashing" part is just a way to globally and uniquely identify a Mii based on its appearance data. It doesn't necessarily need to be a "hash" at all (in fact, none of the methods I mentioned earlier outside of the perceptual hash were even "hashes" to begin with, they were just different encodings of the appearance data). It's supposed to just be a way for the servers to look at incoming Mii data and quickly decide if it needs to render the Mii or skip the rendering because it already did that work |
Okay, I see that they can't necessarily be tied to the users. But if the input appearance data is still identical, and you're not expecting to have more than 11 characters worth(?) of Mii's, could you not still have a database table with a primary key (that you can use as this 'hash') and a unique key with the Mii appearance data?
The only remaining problem I could see with this approach is cleaning up stale renders/entries in the appearance database, which could be a bit of a faff. Just food for thought! |
My original comment stated the opposite of this, we do expect to have more. Even if we never reach that many users, we still need to account for that many Miis. 11 alpha numeric is only 62^11 (52,036,560,683,837,095,936) combinations. According to HEYimHeroic, there are 5,174,537,177,903,891,456,720,160,880,400,592,528,520 total possible Mii combinations on the original Wii, and the Wii U added even more combinations to this
The issue with this is that the whole input data won't be identical even if the render is. I mentioned this in my original comment as well. The entire input data contains data that is irrelevant to the final render, such as device origin, birthday, Mii/owner names, etc. Even if the same appearance data is used, the entire input may be different. Which means we would need to do some processing of the input data beforehand and extract out just the appearance data, and now you're right back to my original comment on things
There is no "user storing the key" here, that's just not what's going on. It's about how we store images and reference them internally. The client is always just given the URL to the render (or the raw Mii data) by the server As for the other 2 things, I mean that's kinda just already what's being proposed just with a few extra steps...? Using the appearance data as the unique identifier was the exact first thing I mentioned in my original comment:
I think you might be getting too caught up on the word "hash" here, really nothing outside of the phash has even been a hash; they've just been unique identifiers that use the appearance data in different encodings. We used the term "hash" because one of the early ideas (again, this discussion started internally and this issue was just made to document it) was to hash the input data using something like md5 or sha, and because "mii hash" was an existing term. We've just continued to use the term "hash" for consistency reasons between our discussion sessions, really all we're talking about is "unique identifier for a Mii render" Adding this database layer on top would work, sure, but it's just more steps and a bit more overhead than really necessary. It's a bit over-engineered. The appearance data itself would already be a unique identifier, there's no real need to then also slap a database with it's own primary key on top of that, and doing lookups in the database vs a storage lookup is really not much different in terms of speed (we only support 2 storage options, either local file storage or s3, both of which are more than fast enough) except that with a database on top of things we're now also using storage to save records when we don't really need to. The database would only really end up being a mapping between appearance data and the unique IDs and to report if a render exists or not, which is kinda pointless when we can just skip the database entirely when creating the unique IDs and checking the storage solution for the render is already fast enough |
Checked Existing
What enhancement would you like to see?
When uploading a Mii image to the CDN, hash it, so that the unique Mii combination is distinctly uploaded.
Any other details to share? (OPTIONAL)
Nintendo previously did this, and allowed them to cache the Mii images long-term - we have previously ran into stale caching causing confusion for users where their Mii hasn't appeared to have updated on Juxt/NNID settings.
The text was updated successfully, but these errors were encountered: