Debug your life like a DRL agent

Mohamed Abogazia
5 min readAug 16, 2020

This is an intersection point I found myself in after reading “the subtle art of not giving a f***” and “Reinforcement learning”

Suffer.

why do I always suffer? what’s the point? and why always me? what’s wrong with my life?

I’m Arab. These two words have given me more disadvantages than you could possibly imagine, not being Arab itself but being born in a third world neglected country, everyone judges me for it, though I didn’t choose it, though I’m facing enough problems of no eduction, no income (American dollar valued), no nothing.

and I started questioning why my life is so f*ed up? It got me really sad, I don’t want to say depressed because depression is more serious than that, I don’t wanna abuse it.

after some reading, I came to the fact that I’m not the only one who’s suffering everyone suffer and I’ll continue to suffer my whole life whether I liked it or not, and I can bitch and whine about it or I can take responsibility and figure out a way of suffering better, of suffering beautifully as Lex Fridman says.

Imagine you’re training a reinforcement learning agent and your agent is receiving only negative rewards and then it crashes and then more negative rewards every epoch, you’ve trained it for hours, it’s not evolving. you realize something is wrong and you start debugging your agent, what could be wrong?

1- Your value function is misleading or superficial:

imagine your agent is a racing car and your value function is to be faster than other cars in the track. at first glance, this seems like a good value function, if you’re faster than everyone else, you’re going to be first! easy!

with this value function, your agent will rarely get to half of the track, no matter how much you train it no matter how much it suffers, it will be a failure.

this is you when you think that you’ll be happy when I’m richer than everyone else or when I have more knowledge than everyone else (the latter was mine for so long)

it’s just stupid because it’s dependent on external events, you can’t control others’ speed, a better one would be “go as fast as possible at every point of the race” on top of that, speed isn’t the only factor despite being an actual speed race, it doesn’t matter if you’re going at 300 km/hr and end up shooting off the track, so even better one is “go as fast as possible at every point of the race while maintaining balance with accuracy”

2- Your learning algorithm is not correct:

when you study any DRL algorithm you also study proof that it will -after some training(or suffering in human language)- achieve its goal which is maximizing the reward function. there are no proofs in real life that you’re learning, but you’re smart enough to figure out if you’re learning or not unlike the agent. if you’re not learning from your suffering you’re not befitting from your suffering, you just take the bad half and leave out the good one, you’re going to suffer anyway so learn from it.

but how does that map to our life as human beings?

Correct RL algorithms are around the idea of making the choice that would maximize the total reward, so should you. If you wake up one day, look in the mirror, and ask yourself will what I’m about to do today maximize my reward function? Will I receive a positive reward for it?
and by reward, I don’t mean money or fame, I mean a human reward. Luckily human beings come with a preinstalled reward system. and if you listen to it, if the algorithm you wake up every day to maximize this total reward, that’s when you suffer beautifully.

and when I say total reward, I mean total reward, part of the challenges of DRL is the inability to predict the future, but you can.

drinking or drugs will greatly boost your reward system, but is it going to maximize your total reward? how much negative reward are you going to receive if you end up addicted? this is called a “greedy” algorithm, to go for the short term big reward and depend on luck after that, maybe you’ll be an alcoholic maybe not. most of the time you do, and it’s proven not to be a correct algorithm.

3- Maybe the environment is too complicated?

if you’re sure that the value function is acceptable and your algorithm is correct and the agent is still not reaching the goal, then what’s wrong?
not all agents take the same training time, also not all humans suffer equally. some agents train on a track that’s just a straight line, others train on a track that’s full of sharp turns and going up and down. while agents in a straight line train less they are less intelligent agents, if you test them on a slightly more complicated track they f* up really bad. however, it’s better for more intelligent agents to start on a simple track then move gradually to more complicated ones.

so if you’re choosing good values to follow and you’re actually following them and your reward system is still giving you negative reward, give yourself some time, the more complicated your track the more suffering it takes, maybe take yourself to a less complicated track if you have a choice, you’ll receive less valued reward than the complicated track but at least you will begin getting some positive rewards that will teach you what to do, fewer choices mean a higher probability of getting the right one, and when you move back to the complicated track, you will have the knowledge to choose based on, however, if you don’t have that chance, you should believe that you were given that track for a reason, “You have to trust in something — your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all the difference in my life.” -Steve Jobs. yes, you’re in a harder environment than other agents, but this is a chance, a chance to suffer more and learn more and be a better, smarter agent than those in a simpler environment.

it’s funny how this debugging process is similar to human life. There is a saying that DRL implementations fail silently, the agent won’t tell you “what is this stupid value function you gave me” or “there’s a bug in your implementation” it’ll just train with what you gave it and you won’t even know if there’s something wrong or is it just taking time to learn about the environment, much like our lives, you think your values are great until you’re dying on your bed and you think “oh God! what a waste my life was, I wish I did better”.

but what is the metric here? we create an agent for a purpose and we say this agent is better than that agent if it can achieve that purpose in a smarter, faster, or cheaper way, what about humans?

that’s where human knowledge branches most, at figuring out “what is the meaning of life?”
that will be the subject of my next story :)

Peace out ✌

--

--