Compute MD5 or SHA hash of large file efficiently on iOS and Mac OS X September 7, 2010
Computing cryptographic hashes of files on iOS and Mac OS X using the CommonCrypto APIs is fairly easy, but doing it in a way that minimizes memory consumption even with large files can be a little more difficult… The other day, I was reading what some people were saying about this on a forum about iPhone development, and they thought they found the trick, but they still had a growing memory footprint with large files because they forgot something fundamental about memory management in Cocoa.
Updated
- Friday, October 1, 2010: removed comment about the fact that I used character arrays on the heap with the more modular solution described at the end of the post; this is now fixed, and that more general solution is now as efficient as the simple one described here.
- Sunday, October 17, 2010: added link to a simple GitHub repository that I created to show exactly how to integrate my function FileMD5HashCreateWithPath with a simple iOS or Mac application.
What was wrong with that solution?
Even though they had a solution to read bytes from the file progressively instead of reading everything at once, it did not improve the memory consumption of their program when computing hashes of large files. The mistake they made is that the bytes read in the while loop were in an autoreleased instance of NSData. So, unless they create a local autorelease pool within the while loop, the memory will just accumulate, until the next autorelease pool is drained. But I think it would be very inefficient to add an autorelease pool in the while loop, because you would end up allocating a new object in every pass of the loop.
So, in my opinion, the right question is: how do we read those bytes without getting an autoreleased object?
How to get around that problem?
I looked for a solution, and I couldn’t find anything that would do the same thing as -[NSFileHandle readDataOfLength:] at the Foundation level without returning an autoreleased object. So I thought: we have to go deeper. I looked for something similar in Core Foundation, and sure enough, I found the CFReadStream API.
And since I was going to do this using Core Foundation to read those bytes, I decided to go all the way with Core Foundation, with a solution in pure C.
Here’s how you can compute efficiently the MD5 hash of a large file with CommonCrypto and Core Foundation:
FileMD5Hash.c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | // Standard library #include <stdint.h> #include <stdio.h> // Core Foundation #include <CoreFoundation/CoreFoundation.h> // Cryptography #include <CommonCrypto/CommonDigest.h> // In bytes #define FileHashDefaultChunkSizeForReadingData 4096 // Function CFStringRef FileMD5HashCreateWithPath(CFStringRef filePath, size_t chunkSizeForReadingData) { // Declare needed variables CFStringRef result = NULL; CFReadStreamRef readStream = NULL; // Get the file URL CFURLRef fileURL = CFURLCreateWithFileSystemPath(kCFAllocatorDefault, (CFStringRef)filePath, kCFURLPOSIXPathStyle, (Boolean)false); if (!fileURL) goto done; // Create and open the read stream readStream = CFReadStreamCreateWithFile(kCFAllocatorDefault, (CFURLRef)fileURL); if (!readStream) goto done; bool didSucceed = (bool)CFReadStreamOpen(readStream); if (!didSucceed) goto done; // Initialize the hash object CC_MD5_CTX hashObject; CC_MD5_Init(&hashObject); // Make sure chunkSizeForReadingData is valid if (!chunkSizeForReadingData) { chunkSizeForReadingData = FileHashDefaultChunkSizeForReadingData; } // Feed the data to the hash object bool hasMoreData = true; while (hasMoreData) { uint8_t buffer[chunkSizeForReadingData]; CFIndex readBytesCount = CFReadStreamRead(readStream, (UInt8 *)buffer, (CFIndex)sizeof(buffer)); if (readBytesCount == -1) break; if (readBytesCount == 0) { hasMoreData = false; continue; } CC_MD5_Update(&hashObject, (const void *)buffer, (CC_LONG)readBytesCount); } // Check if the read operation succeeded didSucceed = !hasMoreData; // Compute the hash digest unsigned char digest[CC_MD5_DIGEST_LENGTH]; CC_MD5_Final(digest, &hashObject); // Abort if the read operation failed if (!didSucceed) goto done; // Compute the string result char hash[2 * sizeof(digest) + 1]; for (size_t i = 0; i < sizeof(digest); ++i) { snprintf(hash + (2 * i), 3, "%02x", (int)(digest[i])); } result = CFStringCreateWithCString(kCFAllocatorDefault, (const char *)hash, kCFStringEncodingUTF8); done: if (readStream) { CFReadStreamClose(readStream); CFRelease(readStream); } if (fileURL) { CFRelease(fileURL); } return result; } |
Then, from your Objective-C code, you can just use that function like this:
1 2 3 4 5 6 7 |
Remember that FileMD5HashCreateWithPath transfers ownership of the returned string, so you must release it yourself.
I also created a small GitHub repository that may help you understand how to integrate that code in your project. It contains a very simple Xcode project, with a target for iOS and another one for Mac OS X. In both cases, the application just provides a simple button to compute the MD5 hash of the executable file (the binary). Here is where you can find that repository: FileMD5Hash GitHub repository.
Advantages of this solution
There are several nice things about this implementation:
- first, it works as advertised: it computes the MD5 hash of the file correctly, and it doesn’t make the memory footprint of your app grow, even if you give it the path to a huge file;
- even though the path argument is a CFStringRef, it’s really easy to use this from Objective-C, thanks to the fact that NSString and CFStringRef are toll-free bridged; cf. example above for usage;
- it works just fine both on iOS and on Mac OS X;
- by reusing sizeof(digest), I avoided the pitfall of exposing the real value of CC_MD5_DIGEST_LENGTH, which would make it more difficult to adapt to other cryptographic algorithms.
How about SHA1, SHA256, and others?
It’s really simple to adapt this function to other algorithms. Say you want to adapt it to get the SHA1 hash instead. Here’s what you need to do:
- replace CC_MD5_CTX with CC_SHA1_CTX;
- replace CC_MD5_Init with CC_SHA1_Init;
- replace CC_MD5_Update with CC_SHA1_Update;
- replace CC_MD5_Final with CC_SHA1_Final;
- replace CC_MD5_DIGEST_LENGTH with CC_SHA1_DIGEST_LENGTH;
Or more simply, just do a find and replace to transform every occurrence of the string “MD5” with “SHA1“. Voilà, you got it!
Another way to extend this to other algorithms is to make this function more modular, and basically take all of those things as arguments. This is a little more difficult, but I did it for my project TagAdA. With this more advanced and more modular solution, you have a third argument that represents the algorithm that you wish to use, and you only have one instance of the code associated to that logic in your binary, even if you use several of those cryptographic algorithms in your app. I even went to great lengths using the preprocessor to minimize the amount of duplicated code in my source file.
Anyway, there you go! I hope you will find this useful.
Thanks a lot for the great library and your help getting it to work ;). It’s doing exactly what I needed and it’s lightning fast too.
For those like me struggling to make it work in their iOS projects, I created (with Joel’s blessing) a GitHub repo with the necessary files. You can find it at http://github.com/Fuitad/FileMD5Hash
My pleasure! I’m glad you found that useful Pierre!
How to use it with iOS? I’m getting error:
Undefined symbols:
“FileMD5HashCreateWithPath(__CFString const*, unsigned long)”, referenced from:
-[MyAppDelegate createEditableCopyOfDatabaseIfNeeded] in MyAppDelegate.o
ld: symbol(s) not found
collect2: ld returned 1 exit status
@Andrey
Please make sure to add FileMD5Hash.c to the list of files that Xcode is supposed to compile for your target. One way to do that is to drag and drop FileMD5Hash.c to the “Compile Sources” build phase of your target.
This didn’t work for me because I tried to use it in a .mm file. The solution is simple:
Just add this code to FileMD5Hash.h:
2
3
4
5
#define MYAPP_EXTERN extern "C"
#else
#define MYAPP_EXTERN extern
#endif
and declare the function in FileMD5Hash.h like this:
2
size_t chunkSizeForReadingData);
Thank you Joel for your MD5 code and for the solution on how to use it in .mm files!
No problem Andrey, thanks for reposting this trick in your comment!
I decided to fork Pierre’s GitHub repository, and to add a simple Xcode project that shows how to integrate this code with a simple iOS or Mac application. This should document in more detail things that I intentionally omitted in the blog post (to keep it simple, and more readable).
So, if you can’t figure out how to make this code work in your project, please, take a look at the FileMD5Hash GitHub repository.
And many thanks to Pierre for coming up with this great idea of a simple GitHub repository for this code!
Joel:
I am using your FileMD5Hash.c(compiled only) file as is in a product of ours. It is a unclear to me what I need to include in order to be in compliance with your license.
We have a ReadMe file for our product, is putting a notice in there sufficient?
I do not speak legalize very well. How should this notice read?
Regards
Neil
@Neil You don’t even need to mention me in your README. The only thing I care about is that you keep my copyright notice in the source files, and that if you change the files in any way, you mention that in a comment in the source file. So just enjoy!
Joel. Why can’t you just use the original code and wrap the readDataOfLength and related code into an autorelease pool allocation and release pair? Wouldn’t that be much clearer with the same effect?
Joan
@Joan Your idea would work too, that’s true. However, when you say “much clearer”, I just want to say that it has to do with how familiar you are with Foundation and CoreFoundation. Some people might prefer to use CoreFoundation.
I don’t mind using CoreFoundation for some things, and this implementation is actually a little more efficient than what you’re suggesting. Cf. the Cocoa Fundamentals Guide:
So I guess what I should tell you is this: if you feel more comfortable using Foundation level APIs and you don’t mind or can’t notice the slight performance hit, then you should definitely do it your way.
Joel. Actually I feel very comfortable with CoreFoundation. My background is raw ‘C’ and I even programed in assembler so you can imagine what kind of things I am used to. I even have a strong preference (we could call it obsession) in using core foundation collections instead of their cocoa equivalents, specifically I use use NULL retain/release callBacks all the time on CFArrays and CFDictionaries.
Even when using cocoa I avoid explicit autorelease calls in my code. If I have to return a new object I always implement ‘create’ methods. I only leave implicit autoreleases when the object will be immediately retained anyway so the memory overhead is zero.
So my post was not really about what I would do but about most developers could consider to do.
Said that I still believe that using cocoa tends to be easier, and more convenient for most developers. What the docs recommend about autorelease pools is precisely to avoid doing what the original code did, that is actually *using* autorelease pools. By creating and draining an insider autorelease pool as per my suggestion, what we achieve is to release the objects right there, so in fact avoiding the use of the global autorelease pool, which is what has really to be prevented.
At the end of the day we both are thinking alike and possibly using the same coding patterns, so that’s the important thing.
Joan
Great work 😉
Hi Joel.
I use your trick with success. Great job.
But now I’ve a question for you: can I use your trick with a file on a remote site, then with a ‘filePath’ that is similar to ‘http://…’ ?
Thank’s,
Alex.
thanks for the code, it’s very helpful!
Best Regards,
Daniel Oliva
Hey Everyone!,
Here is the easiest way that I had everything up and running,
1.) Download the FileMD5Hash.c & FileMD5Hash.h from the linked github.
2.) Xcode -> New Project -> Foundation Tool (Command Utility) ->Drag both .c & .h Files into Source folder in Xcode
3.) Follow Andrey's advice in regards to the modification of FileMD5Hash.h;
4.) setting the correct filePath (example : NSString *filePath = "/Users/YourUserName/YourUserFile.pdf";
5.) Run !
*If a Exec_Bad_Access Error occurs it's probably because your trying to CFRelease(md5hash) when md5hash is nil; & md5hash would be nil because the CFStreamOpen Probably failed…
Okay, lastly thanks! and sorry for blowing up your forum with an error message !
Regards,
Daniel
on line 49 is there any reason you are declaring the array inside the for loop. I would hope the compiler is smart enough to allocate that array and keep it around. It just scares me a bit that the C compiler might be dumb enough to reallocate that at every iteration of the loop, and even though that is small chunk when you run that loop thousand times it is going to be troublesome.
thank you very, very much! lifesaver
THANK YOU!!! lifesaver
Hey Joel, this is fantastic! One issue — I can’t seem to compile your sample app for Mac 64-bit. Any plans to get that working? Thanks so much!
I modified your TGDFileHash program to include crc32 checksum. You can find it here:
https://gist.github.com/paul-delange/6808278
Feel free to take it or use as you want
Great !
You are absolutely genius !!
I do have compile problems/errors when compiling with ARC… are you experiencing the same?
e.g. implicit declaration as well as needed casts…
Hi Stefan,
Thanks for reporting this problem. Indeed, FileMD5Hash wasn’t ready to be used with ARC. Can you try again with the new API I just pushed to the GitHub project? It should just work now.
Thanks!
Thanks so much for this. Esp. Paul with your CRC32 checksum addition. This saved me at least a day 🙂
Thanks and its easy to understand…
[…] Link : http://www.joel.lopes-da-silva.com/2010/09/07/compute-md5-or-sha-hash-of-large-file-efficiently-on-i… […]
[…] 链接:http://www.joel.lopes-da-silva.com/2010/09/07/compute-md5-or-sha-hash-of-large-file-ficiently-on-ios-and-mac- os-x / comment-page-1 /#comment-18533 […]
First off – many thanks. That’s a good one.
Second… sorry to be greedy, but why not write it with pure modern Obj-C, so that ARC takes responsibility for Memory, and also better integrate with ObjC code?