You got an email with batch failure. (set: $time_elapsed to 0)(set: $stress to 0)(set: $baddata_exists to true)(set: $wiki_checked to false)(set: $gameover to false)(set: $stress_increment to 10)(set:$time_increment to 1) Note: Watch your stress level! It may affect your productivity... [[Check wiki page|wiki]] [[Ask a senior engineer|seniorwiki]]You opened the wiki page for the batch.(set:$time_delta to $time_increment*0.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) <div class="wiki" ><h1>Rerunning a failed batch</h1> This batch can sometimes be flaky. You can try rerunning the job on Tron first. Use the command <span class="code" >tronctl retry</span></div> [[Rerun job on Tron|firstrerun]] [[Ask a senior engineer|seniorwiki]]You asked a senior engineer on your team what to do.(if:($stress-$stress_increment)<=0)[(set:$stress_delta to $stress*(-1))](else:)[(set:$stress_delta to $stress_increment*(-1))](set:$stress to it+$stress_delta)(set:$time_delta to $time_increment*.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) <!-- asking consumes less time --> <section class="slack" ><div class="channel">#team-something</div ><div><span class="name" style="color: #4d4da2;">you</span>I'm having trouble with batch failure!</div ><div><span class="name" style="color: #4d4da2;">you</span>What should I do?</div ><div><span class="name" style="color: #a24d4d;">senioreng</span>Have you checked runbook on the wiki page?</div ><div><span class="name" style="color: #a24d4d;">senioreng</span>There should be some instructions there.</div ></section> [[Check wiki page|wiki]]You notified stakeholders. Your stress level decreased. (set:$stress to (max:0,$stress-1))(set: $time_elapsed to it+1) ----- Time Elapsed: (print: $time_elapsed) hours Oncall Stress: (print: $stress)% ----- <!-- this is still in development What do you do next? (if:$job_rerun is false)[ [[Rerun job on Tron|firstrerun]] ](if:$wiki_checked is false)[ [[Check wiki page|wiki]] ] [[Escalate to a senior engineer|seniorwiki]] -->You reran the job on Tron. (set:$time_delta to $time_increment*0.5*(1+$stress/100))(set: $time_elapsed to it+$time_delta)(set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) The job failed after (print: $time_delta*60)mins, so it doesn't seem like a flake... [[Check EMR logs|checkemr]] [[Ask a senior engineer|senioremr]]The EMR log shows a KeyError. Looks like the input is corrupted and missing a required field.(set:$time_delta to $time_increment*0.5*(1+$stress/100))(set: $time_elapsed to it+$time_delta)(set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) [[Change the batch (data consumer) code|changeconsumer]] [[Change the logging (data producer) code|changeproducer]] [[Clean bad log line|cleanbadlog]] [[Ask a senior engineer|seniordontchangeproducer]]You asked a senior engineer on your team what to do. They recommended you check the EMR log on S3. (if:($stress-$stress_increment)<=0)[(set:$stress_delta to $stress*(-1))](else:)[(set:$stress_delta to $stress_increment*(-1))](set:$stress to it+$stress_delta)(set:$time_delta to $time_increment*.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) <!-- asking consumes less time --> <section class="slack" ><div class="channel">#team-something</div ><div><span class="name" style="color: #4d4da2;">you</span>I tried rerunning the job on Tron.</div ><div><span class="name" style="color: #4d4da2;">you</span>But it's still failing.</div ><div><span class="name" style="color: #a24d4d;">senioreng</span>So it's not a flake?</div ><div><span class="name" style="color: #a24d4d;">senioreng</span>You could look into the EMR log for traceback.</div ></section> [[Check EMR log|checkemr]]You worked with ops to clean the bad log line from S3.(set:$time_delta to $time_increment*2*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) <!-- 2x time--> [[Rerun the batch|rerunsuccess]]You came up with a code fix for the data producer and pushed out the change.(set:$time_delta to $time_increment*(1+$stress/100))(set: $time_elapsed to it+$time_delta)(set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) [[Rerun the batch|rerunfail]]You came up with a code fix.(set:$time_delta to $time_increment*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) How should we release it? [[I like to live dangerously. Rush it!|rushrelease]] [[Write unit tests, run manual tests and send out a code review. |carefulrelease]] You released the code without reviews or testing. #agiledevelopment (set:$time_delta to $time_increment*0.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) <!-- this one takes half the time --> [[Rerun the batch|rerunfail]]You sent out a code review and got a shipit. It took a bit of time but this should work... <!-- this one consumes 3x amount of time --> (set:$time_delta to $time_increment*3*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) [[Rerun the batch|rerunfail]]Congratulation! The batch succeeded and you've closed a p0 case :megahappy: Final stats(set:$time_delta to $time_increment*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) ----- Time Elapsed: (print: $time_elapsed) hours Oncall Stress: (print: $stress)% ----- [[Play again|init]] (link:"Fill out feedback form")[(gotoURL:"https://docs.google.com/forms/d/e/1FAIpQLSest4tngyTeWZ9DBudZLUL5IUsyICTXhduuPellDXnny9LR2A/viewform")]You reran the batch but it still seems to be failing... Let's go back to investigation.(set:$time_delta to $time_increment*0.5*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) [[continue|checkemr]]You asked a senior engineer what to do. (if:($stress-$stress_increment)<=0)[(set:$stress_delta to $stress*(-1))](else:)[(set:$stress_delta to $stress_increment*(-1))](set:$stress to it+$stress_delta)(set:$time_delta to $time_increment*.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) <!-- asking consumes less time --> <section class="slack" ><div class="channel">#team-something</div ><div><span class="name" style="color: #4d4da2;">you</span>There is something wrong with the input log. </div ><div><span class="name" style="color: #4d4da2;">you</span>What should I do about it?</div ><div><span class="name" style="color: #a24d4d;">senioreng</span>You could either fix the input or the batch code.</div ><div><span class="name" style="color: #a24d4d;">senioreng</span>I don't think updating the logging code will do anything</div ></section> [[Clean bad log line|cleanbadlog]] [[Change the batch (data consumer) code|changeconsumer]] [[Change the logging (data producer) code anyway|changeproducer]]<link href="https://fonts.googleapis.com/css?family=Bitter:400,700|Lato|Ubuntu+Mono" rel="stylesheet"><div class="page"><section class="header">OnCall of Duty</section ><section class="content"></section><section class="footer"></section ><div class="stats center">Time: (print: (round:$time_elapsed*100)/100) hrs (+(print: (round:$time_delta*100)/100)) | Stress: (print: (round:$stress*100)/100)% ((if:$stress_delta>=0)[+](print: (round:$stress_delta*100)/100))</div ></div> (if: $stress>100)[(if: $gameover is false)[ (goto: "gameover") ] ] Slack snippet: <section class="slack" ><div><span class="name" style="color: #4d4da2;">helplessdev</span>I have a problem!!!</div ><div><span class="name" style="color: #a24d4d;">opsperson</span>Don't worry, I can help. What seems to be the problem?</div ><div><span class="name" style="color: #4d4da2;">helplessdev</span>Search pages are serving 500 errors and I don't know where they're coming from</div></section> goofy flashing panic message! <div class="flash center">DAR-2152</div> Code snippet: <div class="code" >x = 67; status = `_dfsk___ldfj`; hope += 1;</div> regular button: (click to see back button) [[go to next step|example page]] Fancy back button <div class="back">[[back to style|Style use examples]]</div> ----Game Over----(set: $gameover to true) Your manager has decided to ask someone else to be in charge of this incident. Final stats ----- Time Elapsed: (print: $time_elapsed) hours Oncall Stress: (print: $stress)% ----- [[Play again|init]] (link:"Fill out feedback form")[(gotoURL:"https://docs.google.com/forms/d/e/1FAIpQLSest4tngyTeWZ9DBudZLUL5IUsyICTXhduuPellDXnny9LR2A/viewform")]