Doing a project with this scale of data (100 million ratings, by 480189 users, on 17770 movies) shows a person just how much they don’t know about programming yet.
Here’s a few things I’ve learned so far:
Objects take up a crap ton of memory.
You don’t realize this when you’re just making you’re little Person objects in class, but when you try and fit 100 million ratings in memory, objects are not your friend. It’s all about arrays of primitive types.
Integers take up way too much space.
Even the relatively small, programmer abused
inttype will make you cry when you try to do this. Had to use
shortfor movie ids and
Doubles are stupid.
I never knew this but double math in Java is really imprecise. I don’t have much choice (both speed and memory wise) but to use them, but adding simple decimals comes out really goofy. Google “java double arithmetic” and see the madness.
Java does not pass by reference.
It passes references by value. (I had heard this before but never ran into a problem with it until now) You may say, “What the heck does that mean?” Well, for example take a
fidoand pass it to a method that takes a
Dogparameter. Let’s say it’s called
goofyin the method. Right now
goofypoint to the same thing. If you change something in the
goofyobject it will change the same thing in the
fidoobject, because they’re the same object. But if you say
goofy = new Dog();or
goofy = nullthey no longer point to the same thing.
fidostill points where it did but now
goofypoints to something else.
That doesn’t seem like too much of a problem, but… I had an array of 48,000
PrintWriterobjects as I was trying to reorganize the data. At the end of each run I used a for each (
for (PrintWriter print : outputArray)) to set them all to null, so they could be reused in the next run. But since each element in that array was set into a newly declared
print = nulldidn’t set
outputArray[i] = null. So a wasted hour or two of running the sorting program.
That’s all I have for now, but I’m sure I’ll be posting about this again.