[hibernate-dev] Jenkins job priorities

Discussion:

Yoann Rodiere

2018-01-10 10:06:09 UTC

Hello,

TL;DR: I installed a plugin to prioritize Jenkins jobs, please let me know
if you notice anything wrong. Also, I will remove the Heavy Job plugin
soon, let me know if you're not okay with that.

I recently raised the issue on HipChat that some Jenkins builds are
triggered in batch, something like 4 or 5 at a time. Since builds are
executed in the order they are requested, this forces the next requested
builds to wait for more than one hour before being executed, regardless of
their urgency.
One example of such batch is whenever something is pushed to Hibernate ORM
master (or Search master, probably): one build is triggered for tests
against H2, another for tests against PostgreSQL, another for tests against
MariaDB, and so on.

It turns out there is a solution for this problem: the PrioritySorter
plugin. I installed the plugin on CI and configured it to give higher
priority to the following builds:

- Builds triggered by users (highest priority)
- Release builds (builds in the "Release" view)
- Website builds (builds in the "Website" view)
- PR builds (builds in the "PR" view)

In practice, such builds will be moved to the front of the queue whenever
they are triggered, resulting in reduced waiting times.

I hope we will be able to use this priority feature instead of the Heavy
Job plugin (which allows to assign weights to jobs), and avoid concurrent
builds completely. With the current setup, someone releasing his/her
project will only have to wait for the currently executing build to finish,
and will get the highest priority on the release builds. Maybe this is
enough? If you disagree, please raise your concerns now: I will disable the
Heavy Job plugin soon and set each slave to only offer one execution slot.

Please let me know if you notice anything wrong. I tested the plugin on a
local Jenkins instance, but who knows...

Yoann

--
Yoann Rodiere
***@hibernate.org / ***@redhat.com
Software Engineer
Hibernate NoORM team

Guillaume Smet

2018-01-10 10:25:57 UTC

Permalink

Hi,

Post by Yoann Rodiere
I hope we will be able to use this priority feature instead of the Heavy
Job plugin (which allows to assign weights to jobs), and avoid concurrent
builds completely. With the current setup, someone releasing his/her
project will only have to wait for the currently executing build to finish,
and will get the highest priority on the release builds. Maybe this is
enough? If you disagree, please raise your concerns now: I will disable the
Heavy Job plugin soon and set each slave to only offer one execution slot.

I'm not really convinced by this solution. Some jobs still take quite a lot
of time and having to wait 20 minutes for each job I would trigger is a bit
annoying.

If it was for only one job, it would be acceptable, but let's take the
worst case of a coordinated HV release :
- TCK release
- API release
- HV release
- website
- blog

I won't have to wait for each of them as some of them will be grouped by
the prioritization but I'm pretty sure I will have to wait for several of
them.

So, I'm +1 on having this plugin as it seems to be helpful on its own but
I'm -1 on considering it is a solution to the "let's roll a release" thing.

--
Guillaume

Sanne Grinovero

2018-01-10 11:08:35 UTC

Permalink

Post by Guillaume Smet
Hi,

Thanks Yoann! that sounds great.

Post by Guillaume Smet
I'm not really convinced by this solution. Some jobs still take quite a lot
of time and having to wait 20 minutes for each job I would trigger is a bit
annoying.
If it was for only one job, it would be acceptable, but let's take the
- TCK release
- API release
- HV release
- website
- blog
I won't have to wait for each of them as some of them will be grouped by
the prioritization but I'm pretty sure I will have to wait for several of
them.
So, I'm +1 on having this plugin as it seems to be helpful on its own but
I'm -1 on considering it is a solution to the "let's roll a release" thing.

Some of our test suites used to take 2 hours to run (even 5 days some
years ago); now you say waiting 20 minutes is not good enough? You'll
have to optimise our code better :P

It's very easy to spin up extra nodes; my recommendation is that when
you know you're about to release [for example approximately one hour
in advance while you might be double-checking JIRA state and such
things] hit that manual scale-up button and have CI "warmed up" with
one or two extra nodes.

By the time you need to trigger the release job you'll have the build
queue flushed, the priority plugin helping you out, and still
additional extra slaves running to run it all in parallel.

And of course for many releases we don't care for an extra 30 minutes
so you're free to skip this all if it's not important; incidentally
for "work in progress" milestones like the module packs which we
recently re-released several times while polishing up the PR I've been
releasing from my local machine; it's good to have CI automate things
but I don't think we should get in a position to require 100%
availability from CI: practice releases locally sometimes.

If we really wanted to invest more in it (both time and budget),
there's the option of spinning up new containers for each job as soon
as you need one but I feel like we've spent too much time on CI
already; such technology is maturing so my take is let it mature a bit
more, and in 6 months we'll do another step of improvement; jumping on
those things makes us otherwise the beta testers and steals critical
time from our own projects.
Let's not forget that many Apache projects take a week or two to
perform a release, we all know of other projects needing months, so by
the law of diminishing returns I don't think we should invest much
more of out time to shave on the 10 minutes.. just spin up some extra
nodes :)

Thanks,
Sanne

Post by Guillaume Smet
--
Guillaume
_______________________________________________
hibernate-dev mailing list
https://lists.jboss.org/mailman/listinfo/hibernate-dev

Davide D'Alto

2018-01-10 11:15:34 UTC

Permalink

Post by Sanne Grinovero
Let's not forget that many Apache projects take a week or two to
perform a release, we all know of other projects needing months, so by
the law of diminishing returns I don't think we should invest much
more of out time to shave on the 10 minutes.. just spin up some extra
nodes :)

Post by Sanne Grinovero

Post by Guillaume Smet
Hi,

Thanks Yoann! that sounds great.

Some of our test suites used to take 2 hours to run (even 5 days some
years ago); now you say waiting 20 minutes is not good enough? You'll
have to optimise our code better :P
It's very easy to spin up extra nodes; my recommendation is that when
you know you're about to release [for example approximately one hour
in advance while you might be double-checking JIRA state and such
things] hit that manual scale-up button and have CI "warmed up" with
one or two extra nodes.
By the time you need to trigger the release job you'll have the build
queue flushed, the priority plugin helping you out, and still
additional extra slaves running to run it all in parallel.
And of course for many releases we don't care for an extra 30 minutes
so you're free to skip this all if it's not important; incidentally
for "work in progress" milestones like the module packs which we
recently re-released several times while polishing up the PR I've been
releasing from my local machine; it's good to have CI automate things
but I don't think we should get in a position to require 100%
availability from CI: practice releases locally sometimes.
If we really wanted to invest more in it (both time and budget),
there's the option of spinning up new containers for each job as soon
as you need one but I feel like we've spent too much time on CI
already; such technology is maturing so my take is let it mature a bit
more, and in 6 months we'll do another step of improvement; jumping on
those things makes us otherwise the beta testers and steals critical
time from our own projects.
Let's not forget that many Apache projects take a week or two to
perform a release, we all know of other projects needing months, so by
the law of diminishing returns I don't think we should invest much
more of out time to shave on the 10 minutes.. just spin up some extra
nodes :)
Thanks,
Sanne

Post by Guillaume Smet
--
Guillaume
_______________________________________________
hibernate-dev mailing list
https://lists.jboss.org/mailman/listinfo/hibernate-dev

_______________________________________________
hibernate-dev mailing list
https://lists.jboss.org/mailman/listinfo/hibernate-dev

Guillaume Smet

2018-01-10 11:33:10 UTC

Permalink

Post by Sanne Grinovero
Some of our test suites used to take 2 hours to run (even 5 days some
years ago); now you say waiting 20 minutes is not good enough? You'll
have to optimise our code better :P

What I'm saying is that in the current setup, I don't wait at all when I
have something to release.

All is passed in parallel to the currently running jobs.

And it works well.

Post by Sanne Grinovero
It's very easy to spin up extra nodes; my recommendation is that when
you know you're about to release [for example approximately one hour
in advance while you might be double-checking JIRA state and such
things] hit that manual scale-up button and have CI "warmed up" with
one or two extra nodes.
By the time you need to trigger the release job you'll have the build
queue flushed, the priority plugin helping you out, and still
additional extra slaves running to run it all in parallel.
And of course for many releases we don't care for an extra 30 minutes
so you're free to skip this all if it's not important; incidentally
for "work in progress" milestones like the module packs which we
recently re-released several times while polishing up the PR I've been
releasing from my local machine; it's good to have CI automate things
but I don't think we should get in a position to require 100%
availability from CI: practice releases locally sometimes.

Well, the ultimate goal of releasing on CI is to have consistent releases
and an automated process.

I really don't want to build a release locally and be at risk of doing
something wrong.

That's the main purpose of an automated process and having a stable machine
doing it.

What I'm saying is that the current setup is working very well for releases
and the proposed setup won't work as well.

You can find all sorts of workarounds but it won't work as well and be as
practical as it used to be. Yeah, you can think of starting another node 1
hour before doing your release and hope it will still be there and you
won't have another project's commit triggering 4 jobs just before you
start. Sure. But I'm pretty sure it's going to be a pain.

I'm probably the one doing releases the most frequently with HV, that's why
I am vocal about it.

And maybe I'm the only one but, when I'm working on a release, I don't like
to do stuff in parallel because I don't want to forget something or make a
mistake. So I'm fully focused on it. Waiting 20 minutes before having my
job running will be a complete waste of time. And if it has to happen more
than one time on a given release time, I can predict I will get grumpy :).

That being said, if you have nothing against me cancelling the running jobs
because they are in the way, we can do that. But I'm not sure people will
like it very much.

--
Guillaume

Sanne Grinovero

2018-01-10 15:50:15 UTC

Permalink

Post by Guillaume Smet

What I'm saying is that in the current setup, I don't wait at all when I
have something to release.
All is passed in parallel to the currently running jobs.
And it works well.

I'm confused now. AFAIK this has never been the case? I understand
that the release process itself runs without running the tests, but
I'd still run the tests by triggering a full build before.
You made the example of the TCK and various tests; to run them you'd
not be allowed to run them in parallel with other builds, so you
wanted to release and the jobs happened to be building ORM and all its
RDBMS, you'd have had to wait for a couple hours.

Post by Guillaume Smet

Well, the ultimate goal of releasing on CI is to have consistent releases
and an automated process.
I really don't want to build a release locally and be at risk of doing
something wrong.
That's the main purpose of an automated process and having a stable machine
doing it.

Still I don't really understand if you're having a better idea. In a
nutshell these jobs need resources, if they are busy you either add
more resources, or change priorities, or you wait. That's the three
aspects you can play with "safely".

Then there's the option of playing with parallelism, but it's really
dangerous: it risks failing both your release and causing failures in
the other tests which are hard to expliain, cause confusion among us
all, and ultimately lead to have to repeat all involved jobs so
consuming unnecessarily more resources and time.
In many cases parallelism isn't even an option, for examplethe ORM
builds consume most system memory so you just can't run additional
JVMs to run the TCK or similar jobs; if it was safe, I would be using
smaller machines.

Post by Guillaume Smet
I'm probably the one doing releases the most frequently with HV, that's why
I am vocal about it.
And maybe I'm the only one but, when I'm working on a release, I don't like
to do stuff in parallel because I don't want to forget something or make a
mistake. So I'm fully focused on it. Waiting 20 minutes before having my job
running will be a complete waste of time. And if it has to happen more than
one time on a given release time, I can predict I will get grumpy :).
That being said, if you have nothing against me cancelling the running jobs
because they are in the way, we can do that. But I'm not sure people will
like it very much.

Just make sure you ask for permissions, but yea we've done that
previously, hopefully won't be needed often, but it's always an
option.

Post by Guillaume Smet
--
Guillaume

Steve Ebersole

2018-01-10 16:00:01 UTC

Permalink

And in advance I say I would not be cool with you killing my jobs for your
job to run

Post by Sanne Grinovero

Post by Guillaume Smet

What I'm saying is that in the current setup, I don't wait at all when I
have something to release.
All is passed in parallel to the currently running jobs.
And it works well.

Post by Guillaume Smet

Well, the ultimate goal of releasing on CI is to have consistent releases
and an automated process.
I really don't want to build a release locally and be at risk of doing
something wrong.
That's the main purpose of an automated process and having a stable

machine

Post by Guillaume Smet
doing it.

What I'm saying is that the current setup is working very well for

releases

Post by Guillaume Smet
and the proposed setup won't work as well.
You can find all sorts of workarounds but it won't work as well and be as
practical as it used to be. Yeah, you can think of starting another node

Post by Guillaume Smet
hour before doing your release and hope it will still be there and you

won't

Post by Guillaume Smet
have another project's commit triggering 4 jobs just before you start.

Sure.

Post by Guillaume Smet
But I'm pretty sure it's going to be a pain.

Still I don't really understand if you're having a better idea. In a
nutshell these jobs need resources, if they are busy you either add
more resources, or change priorities, or you wait. That's the three
aspects you can play with "safely".
Then there's the option of playing with parallelism, but it's really
dangerous: it risks failing both your release and causing failures in
the other tests which are hard to expliain, cause confusion among us
all, and ultimately lead to have to repeat all involved jobs so
consuming unnecessarily more resources and time.
In many cases parallelism isn't even an option, for examplethe ORM
builds consume most system memory so you just can't run additional
JVMs to run the TCK or similar jobs; if it was safe, I would be using
smaller machines.

Post by Guillaume Smet
I'm probably the one doing releases the most frequently with HV, that's

why

Post by Guillaume Smet
I am vocal about it.
And maybe I'm the only one but, when I'm working on a release, I don't

Post by Guillaume Smet
to do stuff in parallel because I don't want to forget something or make

Post by Guillaume Smet
mistake. So I'm fully focused on it. Waiting 20 minutes before having my

job

Post by Guillaume Smet
running will be a complete waste of time. And if it has to happen more

than

Post by Guillaume Smet
one time on a given release time, I can predict I will get grumpy :).
That being said, if you have nothing against me cancelling the running

jobs

Post by Guillaume Smet
because they are in the way, we can do that. But I'm not sure people will
like it very much.

Just make sure you ask for permissions, but yea we've done that
previously, hopefully won't be needed often, but it's always an
option.

Post by Guillaume Smet
--
Guillaume

_______________________________________________
hibernate-dev mailing list
https://lists.jboss.org/mailman/listinfo/hibernate-dev

Guillaume Smet

2018-01-10 16:28:28 UTC

Permalink

Post by Steve Ebersole
And in advance I say I would not be cool with you killing my jobs for your
job to run

Yeah, that was my understanding.

I don't expect anyone to be cool with it.

Steve Ebersole

2018-01-10 16:40:18 UTC

Permalink

I know ;)

Anyway I do agree that any release jobs should be given the highest
priority in the job queue

Post by Guillaume Smet

Post by Steve Ebersole
And in advance I say I would not be cool with you killing my jobs for
your job to run

Yeah, that was my understanding.
I don't expect anyone to be cool with it.

Yoann Rodiere

2018-01-12 08:12:05 UTC

Permalink

Quick update: the priority plugin seems to be working fine, and I disabled
the Heavy Job plugin. It turns out the Heavy Job plugin was preventing the
Amazon EC2 plugin to spin up new slaves, probably because the Amazon EC2
plugin only saw two empty slots on an existing slave and couldn't
understand that the waiting jobs couldn't be ran with only two slots.
Consequently, the Amazon EC2 plugin now spins up lots of instances, with a
limit of 5. In order to avoid a big hit on the budget, Sanne reduced the
idle timeout to 30 minutes. Please allow 2 minutes for the slave to boot if
there is no slave up when you start your job.

So now we have working Amazon EC2 plugin, ensuring new slaves will be spun
up if there are waiting jobs, and a priority queue, ensuring release/PR
jobs will be ran first in the (hopefully unlikely) event a lot of jobs are
waiting in the queue.
It looks like a reasonable setup, so let's see how it goes for the next
releases and discuss it afterwards.

Post by Steve Ebersole
I know ;)
Anyway I do agree that any release jobs should be given the highest
priority in the job queue

Post by Guillaume Smet

Post by Steve Ebersole
And in advance I say I would not be cool with you killing my jobs for
your job to run

Yeah, that was my understanding.
I don't expect anyone to be cool with it.

_______________________________________________
hibernate-dev mailing list
https://lists.jboss.org/mailman/listinfo/hibernate-dev

--
Yoann Rodiere
***@hibernate.org / ***@redhat.com
Software Engineer
Hibernate NoORM team

Davide D'Alto

2018-01-12 11:38:15 UTC

Permalink

Well done, thanks a lot.

Post by Yoann Rodiere
Quick update: the priority plugin seems to be working fine, and I disabled
the Heavy Job plugin. It turns out the Heavy Job plugin was preventing the
Amazon EC2 plugin to spin up new slaves, probably because the Amazon EC2
plugin only saw two empty slots on an existing slave and couldn't
understand that the waiting jobs couldn't be ran with only two slots.
Consequently, the Amazon EC2 plugin now spins up lots of instances, with a
limit of 5. In order to avoid a big hit on the budget, Sanne reduced the
idle timeout to 30 minutes. Please allow 2 minutes for the slave to boot if
there is no slave up when you start your job.
So now we have working Amazon EC2 plugin, ensuring new slaves will be spun
up if there are waiting jobs, and a priority queue, ensuring release/PR
jobs will be ran first in the (hopefully unlikely) event a lot of jobs are
waiting in the queue.
It looks like a reasonable setup, so let's see how it goes for the next
releases and discuss it afterwards.

Post by Steve Ebersole
I know ;)
Anyway I do agree that any release jobs should be given the highest
priority in the job queue

Post by Guillaume Smet

Post by Steve Ebersole
And in advance I say I would not be cool with you killing my jobs for
your job to run

Yeah, that was my understanding.
I don't expect anyone to be cool with it.

_______________________________________________
hibernate-dev mailing list
https://lists.jboss.org/mailman/listinfo/hibernate-dev

--
Yoann Rodiere
Software Engineer
Hibernate NoORM team
_______________________________________________
hibernate-dev mailing list
https://lists.jboss.org/mailman/listinfo/hibernate-dev

Guillaume Smet

2018-01-10 16:33:19 UTC

Permalink

Post by Sanne Grinovero
I'm confused now. AFAIK this has never been the case? I understand
that the release process itself runs without running the tests, but
I'd still run the tests by triggering a full build before.
You made the example of the TCK and various tests; to run them you'd
not be allowed to run them in parallel with other builds, so you
wanted to release and the jobs happened to be building ORM and all its
RDBMS, you'd have had to wait for a couple hours.

When I start my release process, all my test jobs are green. That's the
precondition.

I usually don't commit something in a haste just before the release.

When I start my release process, my release job has a weight of 2 so it
passes in parallel of the other jobs (be it ORM, Search, or even BV/HV, as
the release job pushes a commit so builds are triggered).

That's why I like this weight plugin.

And yes, this works because the release jobs don't run the tests so I'm
sure there's no conflict of resources with another job.

Post by Sanne Grinovero
Still I don't really understand if you're having a better idea. In a
nutshell these jobs need resources, if they are busy you either add
more resources, or change priorities, or you wait. That's the three
aspects you can play with "safely".

As explained above, there's no conflict of resources in the case of the
current release jobs: they don't run tests.

That's why it works.

--
Guillaume