Branch Branch Revolution

This is something I’ve been meaning to write, but personal circumstances have limited my time recently, and I’m a bit wary of suggesting things I don’t have the time to actively help implement, but it feels very relevant given the other recent discussions, so I’ll just type it out and it can be a bit unpolished.

Maslow as a project, to me, has hit a point (successful) software always hits, where the volume of commits increases, and you get classic issues around scaling. The way these are solved in production software engineering is pretty well established. Interestingly, AI (ignoring the hype) doesn’t actually change any of them - if anything it just shifts-left the point projects hit this as the volume of commits balloons. A time pretty much always comes where you have to sit down and come up with a strategy, and it’s been shown time and again there are three pillars to it actually working.

So, you have three pillars - A Branching Strategy, a Release Strategy, and a Testing Strategy, and they are heavily interconnected. It can be a bit hard to define each without the other, but I’ll just put it out and hope it makes sense.

Why?

There is one underlying premise that this is all based on, that has held true…..forever:

  • Any commit invalidates all previous testing.

Ok, that’s a very strong statement, but as a rule of thumb, it holds - neither experienced engineers or AI can predict all the unexpected consequences, and untested commits - if they do anything useful - will most likely have at least a small unintended consequence, change a corner case or introduce a regression.

So what does that lead to with our strategies?

Branching Strategy

This is about maintaining development velocity but keeping stability.

Instead of having just main and everything happening there and releases just being points on main, it’s normal to have multiple branches of your software.

Key Points:

  • You have release branches, dev/preview branches and possibly LTS branches.
  • Dev/preview is often main - it’s where the development happens.
  • You branch main / dev at the point you want to prepare a stable release for public consumption.
  • There are a few approaches around the timing of branch creation, that are a judgement based on your Release Strategy - how early or late in the release cycle to create it, generally based on how complex your software is and how long it takes to stabilise vs your release cadence.
  • The fundamental point though, is - Release Branches Only Get Bug Fixes.

What we have at the moment is bug fixes and feature development going into the same branch and same releases, and it’s killing us in terms of stability.

Release Strategy

This is about giving the user what they want and maintaining stability.

Instead of every feature being added as soon as possible, you pick and choose and release as often as you can maintain stability.

Key Points:

  • Some users want the latest and greatest and will trade stability, most want no bugs at the trade of latest features.
  • It’s easy to assume that every feature should go in asap, but every feature costs you in terms of dev, testing and bug fixes.
  • You generally balance those by picking the features you want in a release, developing them, stabilising the branch that contains all of the that, releasing it, and then only putting bug fixes in from then on out.
  • New features / untested features can be great, but the place for them is a preview / dev branch (and on a small project, that would often just be main) - that can have nightly releases, it isn’t about slowing development velocity.
  • The cadence that you branch and release is entirely project dependent - from hours to days to months, it just depends on your circumstances.
  • The fundamental point is - The Current Release should always be Stable. The first place a user is pointed should be that stable release.

What we have at the moment is release from main and always point the user to the latest release which means lots of grumpy users who can’t pick stability.

Testing strategy

This is about how you verify your software does what you expect, consistently - and not chasing your own tail with bug fixes constantly. It is how you verify that your current release is always stable.

Every feature should have a test, and ideally everything is tested (though realistically that’s something you strive for but never actually get to).

Key Points:

  • Really the key point is what I put above - any commit invalidates all previous testing.
  • In practical terms, you can focus testing more around the specific area a commit changes and do Smoke Tests to trim the testing to manageable levels though.
  • Normally, testing looks like Smoke Tests (<5 mins automated testing and dev can usually run locally or poke easily), Functional Tests / post-checkin tests (tries to cover everything, dev might be able to run locally, but often is dedicated testing remote hardware) and System Level / Soak / Release testing (testing that’s expensive and/or manual, but you only run when you’re preparing to release a branch).
  • I have a few thoughts on how we might kickstart some testing (that i’ll add below), but I also know I don’t know what we currently do for testing so it might already be underway.
  • The fundamental point is - Without Good Test Coverage, your Tree is Permanently Broken. You need to be able to run automated testing that gives you high confidence a particular checkin doesn’t break the tree, or all your dev time is lost to chasing your own tail with bug fixes. Sustainable velocity is defined by testing.

What we have at the moment appears to be adhoc, but I also don’t really know if I’m honest.

What could it look like?

So, in practical terms, what might we have chosen to do? I think I would have considered:

  • Branch at 1.07, call it the 1.X LTS release, and leave it available to have 1.07.X bug fixes if we want.
    • 1.07 was / is pretty stable, and has value as the last pre-individual-arm-kinematics release. I’ll be honest, I’m still using it on one of my boards as it’s predictable.
  • 1.13 / 1.14 / 1.15 possibly should have been branched at 1.13 as a 2.0 release, and then been stabilised before release as 2.0, with 2.0.X bug fix releases.
    • I would argue the kinematics change, and the anchor point finding changes, should have been in separate releases as the headline features.
    • That doesn’t stop there being dev / preview nightly builds with both in, it just shouldn’t be what most users found first.
  • 1.15 possibly should have (could be) branched as a 2.1 release, with 2.1.X bug fix releases.

Testing thoughts.

Possibly this should go in another thread, but this is just my thoughts without much familiarity.

As we’re a state machine - do we anywhere have a state-transition diagram? Do we have tests derived from that?

What we could do is go a step further into something I’ve not seen talked about widely but have seen in person - a state-transition matrix approach.

In essence, you just do a table with the different states the hardware can be in as the rows, and then each column is a Thing - an input, an event, everything you can think of. Then the table entry is the expected outcome, and building out your testing becomes a case of bingo / ticking each off as you create a test for them. It also forces you when adding a feature to look at what the expected behaviours are / be disciplined about what you add. Realistically it only works if you have a small number of states (and really state-transition diagrams are just sparse matrices of these) - but it feels like Maslow is about the right scale for it to work well.

6 Likes

Testing table and stable release flags look like very wise ideas. We need things to work for most users

2 Likes

Testing is ad-hoc, we really need help figuring out improvements

unfortunantly, with real hardware, it’s far more complex than simply talking
about states

David Lang

Is there a checklist or state change flowchart or grid that we can sketch out for users to run through for testing their build before they start cutting the first time or when they change something even if it imperfect it might be good?

States to transition between:

Power on

Retracted

Extending

Extended

Calibration

Tension release

Tension on

Ready

Alarm

Jogging while alarm

Stopped?

Jogging XY

Jogging z

Cutting or running gcode

Trace outline

Reconnect

Restart

Hard restart

Test

A standard protocol response could help with the generous forum troubleshooting

1 Like

:heart: :heart: :heart: :heart: :heart: :heart: I love this idea so much and I 100% agree that this is what we need to do but I’m not 100% clear on how best to do that. I do think that this ties into the repo clean up / organization that we’re doing.

It feels like we’re in a bit of a transition point from “we’re missing a bunch of basic features and core things aren’t right so lets move fast and add them” to more of a “hey things are pretty good now, let’s take our time and make sure we aren’t regressing something critical to add a feature that is more of a ‘nice to have’ than a ‘need to have’”.

I think that we have basically been doing an adhoc version of branching releases, but all in one branch. Basically we get a good stable version then we change things and break a bunch of stuff and then the next two version or so we refrain from changing anything other than bug fixes to get back to stable 1.12 was a good stable release and I’m hoping that 1.15 is as well.

We need to do a much better job of clearly labeling those and acknowledging what they are.

Our big bottleneck is testing. Because any change truly does have the potential to break everything else a LOT of testing is needed to make sure that things are good which takes hours. I would guess that it takes about 20+ hours of testing for me to feel good about a release which is a lot of time.

I have been posting basically branches for testing here: Interstitial Firmware Releases - #571 by dlang

but that could be a lot more organized.

2 Likes

wouldchuck wrote:

Is there a checklist or state change flowchart or grid that we can sketch out
for users to run through for testing their build before they start cutting the
first time or when they change something even if it imperfect it might be
good?

not currently

There is a list of tests that Bar created, but it’s something like
do a calibration
do a calibration from a new maslow.yaml
after calibration jog around to make sure everything moves correctly.

States to transition between:

first off, I know the original poster is not limiting things to these states,
but to start the conversation, I’m starting here.

Second, I will start with talking about the firmware, not the UI.

maslow internally has 11 states right now:

{ UNKNOWN, “Unknown” },
{ RETRACTING, “Retracting Belts” },
{ RETRACTED, “Belts Retracted” },
{ EXTENDING, “Extending Belts” },
{ EXTENDEDOUT, “Belts Extended” },
{ TAKING_SLACK, “Taking Slack” },
{ CALIBRATION_IN_PROGRESS, “Calibrating” },
{ READY_TO_CUT, “Ready To Cut” },
{ RELEASE_TENSION, “Releasing Tension” },
{ CALIBRATION_COMPUTING, “Calibration Computing” }

look in FluidNC/src/Maslow/Calibration.cpp for the
Calibration::requestStateChange() function to see the code that decides what the
rules are for changing from state to state and what each state does. and the
Calibration::home() function that is heavily overloaded to move belts based on
state.

many of these states are short term states for a particular action (retracting,
extending, taking slack, calibration in progress, calibration computing)

When you power on, it checks a bunch of stuff, including that the maslow.yaml is
sane. after basic initialization, it checks if there is a valid saved position.
If not, it goes into the “unknown” state and you have to retract to get it into
the retracted state, at which point you can extend (going to extending, then
belts extended) at which point you can either calibrate or take slack, either
one will get you to ‘ready to cut’
If there is a valid saved position, I think it goes into belts extended state so
you can apply tension (initially it went to ready-to-cut or retracted, but Bar
changed it to require apply tension and I haven’t looked to confirm exactly
which state it goes in)

when you hit ‘find anchors’, it goes into the calibration state, does some
stuff to find some data points, sends those data points to the browser and
switches to calibration computing. When the browser finishes the computation, it
sends the results back to the firmware and the state switches back to
calibration. This is a fairly ugly process where there is information kept from
one invocation of the calibration state and the next and when you enter the
calibration state, it has to look around and figure out what’s happened and what
it’s supposed to do next. There isn’t a ‘start calibration’ and ‘continue with
next poitns’ state, it’s just all under ‘calibration’

it’s in ready-to-cut that all the useful work is done. while you are ready to
cut, you can be idle, or not idle. when it transitions to idle, it checks if
it’s position matches what’s saved, and if not, saves it’s position. as it
transitions out of idle, it marks the saved position as invalid. More on what
can happen here later.

The alarm mode is a legacy from FluidNC, it’s the rough equivalent of the
unknown state (i.e. the machine doesn’t know where it is), but with a
traditional style machine you can just hit ‘home’ and the machine will find
where it is, but with a maslow you have to do the rehang dance. In alarm mode
you cannot jog x/y or execute a gcode program. We should probably change that
button on the maslow screen to represent the unknown status and bring up the
menu that lets you retract/extend/etc.

trace outline is a UI function, it looks at the loaded gcode, plans an outline,
and sends instructions to the firmware to move around that outline and back to
where it started. To the firmware, it’s just a series of jog commands

FluidNC supports a lot of commands, the gcode commands, the jog commands, a lot
of housekeeping commands, and a bunch of maslow specific commands. I’ll post a
list of those in a separate post.

David Lang

1 Like

Bar wrote:

I have been posting basically branches for testing here: Interstitial Firmware Releases - #571 by dlang

the problem is that there is no difference between a testing branch and a
release. When you announced interstitial releases, you made it sound like those
were supposed to be testing releases, but they aren’t, any release you make is a
release, with a new version number.

now that we have the builds being named via tags, we can talk again about doing
this.

apologies that this is long and technical, it covers a lot of ground and I don’t
want to take shortcuts. It seems complicated, but in practice it’s not, and it
will reduce the unintended bugs from new features enough to save considerably
more time than it takes (and large amounts of it can be automated)

*** Defining version numbers ***
v1.15 is a release
v1.15.2 is the 2nd bugfix of v1.15
v1.15-5-gadba is a test build (5 commits past v1.15)
v1.16-rc2 is a release candidate for 1.16
v1.15.3-rc1 is a release candidate for v1.15.3

v1.15-5-gadba is a build from the main branch at that point in time (and we
should do nightly builds and have people able to follow them). This is the 5th
commit past v1.15 and has a git hash of adba, I will just show this type of
version as v1.15-5-g for simplicity going forward.

***The workflow would be: ***

hack-hack-hack, the master branch changes.
Each commit that is added will change the version
v1.15 becomes v1.15-1-g with the next commit, then v1.15-2-g, …

Each night a build is made and this is the interstitial or ‘nightly’ release for
those brave souls interested in testing the latest and greatest.

hack-hack-hack, your tree is now at v1.15-75-g (75 commits since v1.15)

you think things are good, it’s time to make a release
you make a branch ‘v1.16’ and tag the tree as v1.16-rc1
it gets built and released as a release candidate build
if nobody reports problems with testing, you tag it as v1.16

since development doesn’t ever completely stop (and we are optomistic that the
work we’ve done is correct and no fixes are going to be needed), while this is
happening, additional stuff happens in the main branch and v1.15-78-g is newer
than v1.16-rc1

someone finds a bug in v1.16-rc1, you make a fix that is v1.15-79-g and test it.

you then backport that fix (cherry pick that commit) to the v1.16 branch and tag
it v1.16-rc2 (and this technically invalidates all testing that was done on rc1,
in practice, if it’s a small fix, not all testing is re-done, but as much as
practical should be)

when you tag v1.16, you rebase main on it (since every fix that went into a
v1.16-rc was first in your main branch, this should be trivial, but there may be
some trival merge conficts to resolve if development hasn’t slowed enough

hack-hack-hack you are now at v1.16-10-g and someone discovers a bug in v1.16,
so you fix the bug and it is v1.16-11-g in the master branch, your fix isn’t
perfect so you now have v1.16-12-g as a version with the right fix.

you don’t want to release the master tree as v1.17 (what we’ve done up until
now) because it also has 10 unrelated commits that could cause problems.

so you go to the v1.16 branch and cherry-pick the v1.16-11-g and v1.16-12-g
commits and tag it as v1.16.1-rc1 (otherwise it would show up as v1.16-2-g)
people test it, confirm it works and you then tag it as v1.16.1

hack-hack-hack you are now at v1.16-50-g and it’s time to make a release (see
above)

Note: I am of the ‘release early, release often’ camp, we should not get to
50-75 commits in a release, it should be more a matter of “it’s been X weeks
since the last release, time for the next one” than “there are lots of new
features, we need to get them out” the fact that v1.13 had close to 200 commits
shows we should have released at lot sooner.

more frequent releases mean that there is less to test each release, and while
each commit invlidates all prior testing, testing can be concentrated on the
areas we have changed.

people can enable automated updates and get the latest stable, nightly, or
rc + stable releases, depending on what level or risk they are willing to live
with.

1 Like

is the workflow post above clear enough as is? or do I need to re-post it cleaned up with markdown instead of as an email reply?

1 Like

Proper reply in the morning, but a couple of points:

@dlang 's explanation above is what it generally looks like, I don’t have much to add right this second.

This was the starting point I envisioned too. Something else i’ve seen/used in the past that might help us is a simple command that can dump everything that can be considered ‘state’ (in the more broad sense - this internal state, but also all the values from the yaml, etc) onto the commandline / log. It’ll be a lot, but it proved surprisingly useful to have the ability to check it in a test, diff it while debugging, etc.

Dave wrote:

This was the starting point I envisioned too. Something else i’ve seen/used in
the past that might help us is a simple command that can dump everything that
can be considered ‘state’ (in the more broad sense - this internal state, but
also all the values from the yaml, etc) onto the commandline / log. It’ll be a
lot, but it proved surprisingly useful to have the ability to check it in a
test, diff it while debugging, etc.

there are commands to return files, or specific fields in a file.
what file you want may depend on what is set in other files.
There is also nvram, I’m not sure if you can just dump that or if you query
field by field.
There is a command to dump the belt positions
telnet to your maslow and send the command $CMD and it will output all the
commands available, there is a lot.

but what level of detail are you looking for? the state of TCP connections is
not going to be useful for example.

when you are running find anchors, there is a LOT of state, even when the
machine is paused

and trying to dump data while running can cause grief (this is why the telemetry
that shows motor currents in real-time is disabled by default)

David Lang

What does the computer call the state when it is cutting?

Does this look accurate? suggestions? I missed the home button.

Also a picture of where I was working for entertainment value.

5 Likes

I’d have to go over it with a fine tooth comb to be sure, but it looks pretty darn good to me!

I also asked the AI to come up with one from looking at the code to check against and here is what it came up with:

1 Like

for the maslow code, it’s ‘ready to cut’, the idle/not idle flag is separate

David Lang

1 Like

Diagrams look similar Man vs machine! When would idle not idle be used?

take a look at http://blockdiag.com/ as a tool for doing flowcharts for online
use.

I see a number of things here
Z jogging is not limited to ready-to-cut or alarm state
it doesn’t wait until you connect to do a bunch of stuff (like checking for
saved state, connecting to wifi, auto-update)

David Lang

wouldchuck wrote:

1 Like

when the fluidnc engine is moving the machine in response to commands, it is not
idle. When it finishes all pending movement-type commands (be they jog or other
gcode) it transitions to the idle state.

David Lang

1 Like

There are really two separate state machines. The states like Alarm, Jog, and Idle are all FluidNC states.

The states like Retracted, Extended, etc are Maslow states.

Is the machine trying to connect to the greater web and update itself now? or does it just check to tell the user that there is one? Feels like there should be a very clear option for people to choose if the machine is local or globally connected.

wouldchuck wrote:

Is the machine trying to connect to the greater web and update itself now?
or does it just check to tell the user that there is one? Feels like there
should be a very clear option for people to choose if the machine is local or
globally connected.

by default it does not update, but if you connect to an AP (as opposed to
hosting the network on the maslow), it will check if there is a new version and
put it in the log (v1.13 had a popup to tell you there was a new version)

in the config, you can tell it to auto-update, but it is disabled by default.

David Lang

1 Like