This feed does not validate.
<title>aiGrunn: testing chatgpt checkers - Arend Top & Rix Groenboom</title>
^
In addition, interoperability with the widest range of feed readers could be improved by implementing the following recommendation.
help]
[<?xml version="1.0" encoding="utf-8" ?>
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xml:base="http://reinout.vanrees.org/" xml:lang="en">
<link rel="self"
href="http://reinout.vanrees.org/weblog/atom.xml" />
<link href="http://reinout.vanrees.org/weblog/"
rel="alternate" type="text/html" />
<div xmlns="http://www.w3.org/1999/xhtml">
<a href="http://www.atomenabled.org/feedvalidator/check.cgi?url=http%3A%2F%2Freinout.vanrees.org%2Fweblog%2Fatom.xml">
<img title="Validate my Atom feed" width="88"
height="31"
src="http://www.atomenabled.org/feedvalidator/images/valid-atom.png"
alt="[Valid Atom]" border="0px" />
</a>
<p>
<span>
This is an Atom formatted XML site feed. It is intended to be viewed in
a Newsreader or syndicated to another site. Please visit
</span>
<a href="http://www.atomenabled.org/">Atom Enabled</a>
<span>
for more info.
</span>
</p>
</div>
<title type="html">Reinout van Rees' weblog</title>
<subtitle>Python, grok, books, history, faith, etc.</subtitle>
<updated>2009-04-04T21:44:00+01:00</updated>
<id>urn:syndication:a55644db8591c020bd38852775819a9a</id>
<entry>
<title>Amersfoort (NL) python meetup</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/16/python-amersfoort.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/16/python-amersfoort.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-16T00:00:00+01:00</published>
<updated>2023-11-16T20:28:00+01:00</updated>
<category term="python" />
<category term="pun" />
<content type="html"><![CDATA[
<div class="document">
<p>The first "pyutrecht" <a class="reference external" href="https://www.meetup.com/pyutrecht/events/296641878/">meetup</a> in Amersfoort in the
Netherlands. (Amersfoort is not the city of Utrecht, but it is in the
similarly named <em>province</em> of Utrecht).</p>
<p>I gave a talk myself about being more of a proper programmer to your own
laptop setup. Have a git repo with a <tt class="docutils literal">README</tt> explaining which programs you
installed. An install script or makefile for installing certain
tools. "Dotfiles" for storing your config in git. Etc. I haven't made a
summary of my own talk. Here are the other three:</p>
<div class="section" id="an-introduction-to-web-scraping-william-lacerda">
<h1>An introduction to web scraping - William Lacerda</h1>
<p>William works at <a class="reference external" href="https://www.deliverect.com/">deliverect</a>, the host of the
meeting. Webscraping means extracting data from a website and parsing it into
a more useful format. Like translating a list of restaurants on a</p>
<p>There's a difference with <em>web crawling</em>: that is following links and trying
to download all the pages on a website.</p>
<p>Important: <tt class="docutils literal">robots.txt</tt>. As a crawler or scraper you're supposed to read it
as it tells you which user agents are allowed and which areas of the website
are off-limits (or not useful).</p>
<p>Another useful file that is often available: <tt class="docutils literal">/sitemap.xml</tt>. A list of URLs
in the site that the site thinks are useful for scraping or crawling.</p>
<p>A <strong>handy trick</strong>: looking at the network tab when browsing the website. Are
there any internal APIs that the javascript frontend uses to populate the
page? Sometimes they are blocked from easy scraping or they're difficult to
access due to creative headers or authentication or cookies or session IDs.</p>
<p>A tip: <a class="reference external" href="https://pypi.org/project/beautifulsoup4/">beautifulsoup</a>, a python
library for extracting neat, structured content from an otherwise messy html
page.</p>
<p><a class="reference external" href="https://www.selenium.dev/">selenium</a> is an alternative as it behaves much
more like a regular webbrowser. So you can "click" a "next" button a couple of
times in order to get a full list of items. Because selenium behaves like a
real webbrowser, things like cookies and IDs in query parameters and headers
just work. That makes it easier to work around many kinds of basic protection.</p>
</div>
<div class="section" id="micropython-wouter-van-ooijen">
<h1>MicroPython - Wouter van Ooijen</h1>
<p>A microcontroller is a combination of cpu, memory and some interfaces to
external ports. <a class="reference external" href="https://micropython.org">https://micropython.org</a> is a version of python for such
low-power devices.</p>
<p>He demoed python's prompt running on a <em>raspberrypi micro</em> connected via
microUSB. And of course the mandatory lets-blink-the-onboard-LED programs. And
then some other demoes with more leds and servos. Nice.</p>
<p>A big advantage of micropython is that it doesn't care what processor you
have. With C/C++ you specifically have to compile for the right kind of
processor. With micropython you can just run your code anywhere.</p>
<p>You can use micropython in three ways:</p>
<ul class="simple">
<li>As .py sources, uploaded to the microcontroller.</li>
<li>As pre-compiled <tt class="docutils literal">.mpy</tt> code, also uploaded.</li>
<li>As frozen <tt class="docutils literal">.mpy</tt> included in the images</li>
</ul>
<p>He showed a couple of possible target microcontrollers. A note to myself about
the <tt class="docutils literal">ESP8266</tt>: limited support, use <tt class="docutils literal">.mpy</tt>. I think I have a few of those
at home for should-test-it-at-some-time :-) Some examples: Pi RP2040, ESP32,
Teensy 4.1.</p>
<p>A problem: RAM is scarce in such chips and python is hungry... You can do some
tricks like on-demand loading. Watch out when using an LCD graphic display,
that takes 150kb easily.</p>
<p>You have to watch out for the timing requirements of what you want to
do. Steering a servo is fine, but "neopixel" leds for instance needs a higher
frequency of signals than micropython is capable of on such a
microcontroller. If you use a C library for it, it works (he showed a demo).</p>
</div>
<div class="section" id="graphql-in-python-meet-strawberry-erik-wrede">
<h1>GraphQL in python? meet strawberry - Erik Wrede</h1>
<p>Erik works as maintainer on the Graphene and the strawberry-GraphQL projects.</p>
<p>Graphql is a query language for APIs. It is an alternative to the well-known
REST method. With REST you often have to do multiple requests to get all the
data you have. And the answers will often give more information than you
actually need.</p>
<p>With graphql, you always start with a <em>graphql schema</em>. You can compare it a
bit to an <em>openapi</em> document. The graphql schema specifies what you can
request ("a Meetup has a name, description, list of talks, etc").</p>
<p>An actual query specifies what you want to get back as response. You can omit
fields from the schema that you don't need. If you don't need "description",
you leave it out. If you want to dive deeper into certain objects, you specify
their fields.</p>
<p><a class="reference external" href="https://strawberry.rocks/">Strawberry</a> is a graphql framework. It has
integrations for django, sqlalchemy, pydantic and more. The schemas are
defined with classes annotated with <tt class="docutils literal">@strawberry.type</tt> and fields with
python type hints. (It looked neat!)</p>
<p>He showed a live demo, including the browser-based query interface bundled
with graphql.</p>
<p>Note: strawberry is the more modern project (type hints and so) and will later
have all the functionality of graphene. So if strawberry's functionality is
enough, you should use that one.</p>
</div>
</div>
]]>
</content>
</entry>
<entry>
<title>aiGrunn: be a better developer with AI - Henry Bol</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/10/9-better-developer.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/10/9-better-developer.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-10T00:00:00+01:00</published>
<updated>2023-11-10T15:34:00+01:00</updated>
<category term="aigrunn" />
<category term="ai" />
<category term="python" />
<content type="html"><![CDATA[
<div class="document">
<p>(One of <a class="reference external" href="https://reinout.vanrees.org/weblog/tags/aigrunn.html">my summaries</a> of
the 2023 Dutch <a class="reference external" href="https://aigrunn.org/">aiGrunn</a> AI conference in Groningen, NL).</p>
<p>"Everybody" uses stackoverflow. Now lots of people use chatgpt (or chatgpt
plus). Stackoverflow traffic has dropped by 50% in the last 1.5 year. So
chatgpt can be your coding buddy.</p>
<p>He really likes it for quickly getting something working (MVP). Like writing
something that talks to a magento API (a webshop system). It would take him
ages to figure it all out. Or he could ask chatgpt.</p>
<p>He also thinks you don't need docstrings anymore: you can just ask chatgpt to
explain a snippet of code for you. (<em>Something I myself don't agree with,
btw</em>).</p>
<p>(He demoed some chatgpt code generation of a sample website). What he learned:</p>
<ul class="simple">
<li>Good briefing and interaction is key. First tell it what you want before you
start to code.</li>
<li>Chatgpt sometimes loses track if the interaction goes on for too long.</li>
<li>Read what it gives you, otherwise you won't know what it build for you.</li>
<li>Watch out for the "cut-off time" of the chatgpt training set: perhaps newer
versions of libraries don't work anymore with the generated code.</li>
</ul>
<p>Some dangers:</p>
<ul class="simple">
<li>You get lazy.</li>
<li>You can get frustrated if you don't understand what has been generated for
you.</li>
</ul>
</div>
]]>
</content>
</entry>
<entry>
<title>aiGrunn: testing chatgpt checkers - Arend Top & Rix Groenboom</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/10/8-testing-checkers.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/10/8-testing-checkers.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-10T00:00:00+01:00</published>
<updated>2023-11-10T14:50:00+01:00</updated>
<category term="aigrunn" />
<category term="ai" />
<content type="html"><![CDATA[
<div class="document">
<p>(One of <a class="reference external" href="https://reinout.vanrees.org/weblog/tags/aigrunn.html">my summaries</a> of
the 2023 Dutch <a class="reference external" href="https://aigrunn.org/">aiGrunn</a> AI conference in Groningen, NL).</p>
<p>The world of education was a bit shocked by chatgpt. The instance they work
for advices to be a bit careful, but allows it. <strong>But</strong> you're not allowed to
let chatgpt write parts of your official thesis, just like you're not allowed
to let a family member write it. Chatgpt usage can be treated as fraud.</p>
<p>Well, which tools can be used to search for possible fraud?</p>
<ul class="simple">
<li>GTP-2 output detector</li>
<li>Copyleaks</li>
<li>GPTZero</li>
<li>AI detector pro</li>
<li>Corrector app</li>
<li>Chatgpt (yes, you can ask it whether it looks like it wrote something).</li>
</ul>
<p>They looked at 40 student reports from a variety of fields. Also both Dutch
and English. And from between januari 2020 and june 2022, so before chatgpt
could have been used. For every report, they made three summaries:</p>
<ul class="simple">
<li>One by a human.</li>
<li>One by chagtgpt.</li>
<li>Chatgpt, but altered by QuillBot, which should make it look less
recognizable.</li>
</ul>
<p>So: 120 test samples in total. In the end <em>copyleaks</em> performed the best. The
others didn't do well.</p>
</div>
]]>
</content>
</entry>
<entry>
<title>aiGrunn: small and practical AI models for CO2 reduction in buildings - Bram de Wit</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/10/6-small-practical-ai-models.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/10/6-small-practical-ai-models.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-10T00:00:00+01:00</published>
<updated>2023-11-10T13:20:00+01:00</updated>
<category term="aigrunn" />
<category term="ai" />
<category term="python" />
<content type="html"><![CDATA[
<div class="document">
<p>(One of <a class="reference external" href="https://reinout.vanrees.org/weblog/tags/aigrunn.html">my summaries</a> of
the 2023 Dutch <a class="reference external" href="https://aigrunn.org/">aiGrunn</a> AI conference in Groningen, NL).</p>
<p>LLM models can be huge. Mind-boggling huge. But... we can also have fun with
small models.</p>
<p>He works a company that regulates climate installations in buildings (HVAC, heating,
ventilation, air conditioning) via the cloud. Buildings use 30% of all energy
worldwide. So improving how the HVAC installation is used has a big impact.</p>
<p>A use case: normally you pre-heat rooms so that it is comfy when you
arrive. But sometimes the sun quickly warms the room anyway shortly
afterwards. Can you not conserve some energy without sacrificing too much
comfort?</p>
<p>You <em>could</em> calculate an optimal solution, but "just" measuring every
individual room in combination with an AI.</p>
<p>Technical setup:</p>
<ul class="simple">
<li>An "edge device" inside the building.</li>
<li>An external API.</li>
<li>The API stores the data in mysql (the room metadata) and influxdb (the
timeseries).</li>
<li>A user selects a room and a <em>machine learning model type</em> and a training
data set (from historical data).</li>
<li>The software creates a dataset from influxdb, trains the model
(pytorch). The trained neural network goes to ONNX (open neural network
exchange). The output is stored in minio (S3-compatible object
store). <strong>Note: all this is internal:</strong> no chatgpt or so.</li>
<li>With the business logic these predictions get interpreted and used for
steering the heating. Normally you can achieve 3-5% savings.</li>
<li>The actual steering happens locally in the building with a "go" program that
reads the ONNX data. It is open source and is called... <a class="reference external" href="https://github.com/AdvancedClimateSystems/gonnx">gonnx</a> :-)</li>
</ul>
<p>They have a server with 1 GPU, which is enough for training all those models!</p>
</div>
]]>
</content>
</entry>
<entry>
<title>aiGrunn: learntail, turn anything into a quiz using AI - Arjan Egges</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/10/5-quiz.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/10/5-quiz.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-10T00:00:00+01:00</published>
<updated>2023-11-10T12:54:00+01:00</updated>
<category term="aigrunn" />
<category term="ai" />
<category term="python" />
<content type="html"><![CDATA[
<div class="document">
<p>(One of <a class="reference external" href="https://reinout.vanrees.org/weblog/tags/aigrunn.html">my summaries</a> of
the 2023 Dutch <a class="reference external" href="https://aigrunn.org/">aiGrunn</a> AI conference in Groningen, NL).</p>
<p><a class="reference external" href="https://www.arjancodes.com/">Arjan</a> is known for <a class="reference external" href="https://www.youtube.com/@ArjanCodes">his programming videos</a>.</p>
<p>Alternative title: "the dark side of integrating a LLM (large language model)
in your software". You run into several challenges. He illustrates it with
<a class="reference external" href="https://www.learntail.com/">https://www.learntail.com/</a> , something he helped build. It creates quizes from
text to make the reader more active.</p>
<p>What he used was the python library <a class="reference external" href="https://www.langchain.com/">langchain</a>
to connect his app with a LLM. A handy trick: you can have it send extra
format instructions to chatgpt based on a <em>pydantic</em> model. If it works, it
works. But if you don't get proper json back, it crashes.</p>
<p>Some more challenges:</p>
<ul class="simple">
<li>There is a limit on prompt length. If it gets too long, the LLM won't fully
understand it anymore and ignore some of the instructions.</li>
<li>A LLM is no human being. So "hard" or "easy" don't mean anything. You have to
be more machine-explicit, like "quiz without jargon".</li>
<li>The longest answer it provides is often the correct one. Because the data it
has been trained on often has the longest one as the correct answer...</li>
<li>Limits are hard to predict. The token limit is input + output, so you
basically have to know beforehand how many tokens the AI needs for its
output.</li>
<li>Rate limiting is an issue. If you start chunking, for instance.</li>
</ul>
<p>A LLM is <em>not</em> a proper API.</p>
<ul class="simple">
<li>You need to do syntax checking on the answer.</li>
<li>Are all the fields present? Validation.</li>
<li>Are the answers of the right type (float/string/etc).</li>
</ul>
<p>And hey, <em>you can still write code yourself</em>. You don't have to ask the LLM
everything, you can just do the work yourself, too. An open question is
whether developers will start to depend too much on LLMs.</p>
</div>
]]>
</content>
</entry>
<entry>
<title>aiGrunn: fighting cancer with AI - Hylke Donker</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/10/4-fighting-cancer.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/10/4-fighting-cancer.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-10T00:00:00+01:00</published>
<updated>2023-11-10T11:08:00+01:00</updated>
<category term="aigrunn" />
<category term="ai" />
<category term="python" />
<content type="html"><![CDATA[
<div class="document">
<p>(One of <a class="reference external" href="https://reinout.vanrees.org/weblog/tags/aigrunn.html">my summaries</a> of
the 2023 Dutch <a class="reference external" href="https://aigrunn.org/">aiGrunn</a> AI conference in Groningen, NL).</p>
<p>What is cancer? According to wikipedia: <em>abnormal cell growth with the
potential to invade or spread to other parts of the body</em>. That is what you
can observe. Medically, there are several aspects of cancer:</p>
<ul class="simple">
<li>It prevents the cell from dying.</li>
<li>It can grab more than usual resources.</li>
<li>No sensitivity to the regular anti-growth signals.</li>
<li>Etc.</li>
</ul>
<p>AI starts getting used in clinics. For instance for proton therapy: where to
best apply the proton radiation. And in radiology: letting AI look at images
to detect cancer. A good AI can out-perform doctors. Analysis of blood
samples, trying to detect cancer based on the DNA samples in there.</p>
<p>DNA mutations can also be detected, which is what he focuses on. Cancer is
basically a "desease of the genome". DNA is made up of T, C, G and A
sequences. Technically, it is perfectly feasable to "read" DNA.</p>
<p>How do mutations occur? Exposure can leave "scars" in DNA. Damage can occur
due to sunlight or smoking for instance. Specific sources result in specific
kinds of damage: smoking has a "preference" for changing specific
letters. With analysis, you can thus detect/estimate the cause of cancer.</p>
<p>A method to detect it is <em>non-negative matrix factorisation</em>. Normally you can
only summarize the data in "hard" clusters: something is either A or B. With
this technique, you can do "soft" clusters: something can be a little bit A
and a bit more B.</p>
<p>Matrix factorisation is a way to relate separate data sources. For movies, you
can have persons with preferences for comedy or aciton movies. And movies with
a percentage action/comedy. Combined you get a matrix with estimates for the
preference for every movie per user.</p>
<p>In a similar way, he creates a matrix relating cancer causes (like smoking) to
specific observed types of DNA damage.</p>
<p>But... how reliable are the results? You can treat the matrix as a neural
network. You can then use <em>bayesian analysis</em> to assess the probabilities.</p>
<p>He made a python packge for his reasearch: "mubelnet" (though I couldn't find
that online, btw).</p>
<p>AI is transforming cancer care. The only part it doesn't affect is the actual
nursing process.</p>
</div>
]]>
</content>
</entry>
<entry>
<title>aiGrunn: thinking outside the chat box - JP van Oosten</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/10/3-thinking-outside-chatbox.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/10/3-thinking-outside-chatbox.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-10T00:00:00+01:00</published>
<updated>2023-11-10T10:39:00+01:00</updated>
<category term="aigrunn" />
<category term="ai" />
<category term="python" />
<content type="html"><![CDATA[
<div class="document">
<p>(One of <a class="reference external" href="https://reinout.vanrees.org/weblog/tags/aigrunn.html">my summaries</a> of
the 2023 Dutch <a class="reference external" href="https://aigrunn.org/">aiGrunn</a> AI conference in Groningen, NL).</p>
<p>Getting chatgpt to output valid json can be a chore:</p>
<pre class="literal-block">
> extract xxxx, output as json
> extract xxxx, output as json list
> extract xxxx, output as json with this schema
> extract xxxx, output as json, aargh JSON I BEG YOU
</pre>
<p>Apparently they solved the json problem last monday. But he had the same
problem when trying to get chatgpt to output only English and not Dutch. So
the underlying problem is still there: you have to beg it to output in a
certain way and hope it listens.</p>
<p>Some other problems are <strong>hallucinations</strong>: chatgpt telling you something with
complete confidence, even though being wrong. And <strong>biases</strong>. And it is not
really a chatbot, as it doesn't ask questions. Unparseable output. Lack of
explainability. Privacy issues as you're sending data to servers in the
USA.</p>
<p>And... what are the data sources chatgpt used? We don't know. They're called
"openAI", but they're definitively not open.</p>
<p><strong>When to use LLMs and when not to use them</strong>. Some <em>good</em> use cases:</p>
<ul class="simple">
<li>Zero/few shot learning. A quick way to get a simple minimum viable product
or proof of concept.</li>
<li>Summarizing/transforming.</li>
<li>Data format transformation. html to json for instance.</li>
<li>You can use it to gather training data for easy bootstrapping.</li>
</ul>
<p>Some bad use cases:</p>
<ul class="simple">
<li>Structured classification tasks. You really want proper, neat
output. Especially when you have lots of classes or a big context. For small
personal projects it might be OK, but not for production.</li>
<li>Non-text classification... A large language model of course won't help you
with it.</li>
<li>When costs or energy consumption is important. Scaling is an issue.</li>
<li>When it is unclear who is responsible for what gets outputted. A chatbot
generating "of course, you can get a refund" can be problematic if the
customer really wants the refund it should not get...</li>
<li>When you really want to be sure you get the right answer.</li>
</ul>
<p>What are some ideas you can look at?</p>
<ul class="simple">
<li><tt class="docutils literal">gzip</tt> plus near-neighbor analysis. Compress text and see how similar they
are. It is not perfect, but it is a neat trick.</li>
<li>"Bag of words" plus "random forest" (a function from scipy).</li>
<li>Embeddings and a classifier. A LLM is used to annotate a dataset and you can
then extract the interesting data and work with it.</li>
</ul>
<p>What he thinks is important: <strong>keep humans in the loop</strong>. Prevent unwanted
consequences. Add a preview step before sending stuff out into the world. Make
classifications visible and allow corrections. Ask the user to label something
if it is unclear. And don't forget to audit the automatic classifications.</p>
<p>When all you have is a LLM, everything might start to look like a generative
task. But don't think like that. Who is going to use it? What is the actual
problem? Spend some time thinking about it.</p>
</div>
]]>
</content>
</entry>
<entry>
<title>aiGrunn: 6 challenges to overcome bringing your LLM app to production - Wijnand Karsens</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/10/2-challenges-production.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/10/2-challenges-production.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-10T00:00:00+01:00</published>
<updated>2023-11-10T09:55:00+01:00</updated>
<category term="aigrunn" />
<category term="ai" />
<content type="html"><![CDATA[
<div class="document">
<p>(One of <a class="reference external" href="https://reinout.vanrees.org/weblog/tags/aigrunn.html">my summaries</a> of
the 2023 Dutch <a class="reference external" href="https://aigrunn.org/">aiGrunn</a> AI conference in Groningen, NL).</p>
<p>Alternative title: <em>five reasons your boss doesn't allow you to work on your
LLM app idea</em>.</p>
<p>Show of hands at the beginning. "Who has never used chatgpt". I think I was
the only one raising my hand :-) Lots of people are interested in
it. According to google search queries, more people are interested in <em>prompt
engineering courses</em> than in programming courses. Working in generative AI is
a great work field at the moment.</p>
<p>Wijnand played a lot with it. He made a linkedin autoresponder, a whatsapp
chatbot, a rap song generated, etc. To become enthousiastic about it he
recommends checking out <a class="reference external" href="https://devday.openai.com/">https://devday.openai.com/</a> .</p>
<p>There are several common drawbacks you can hear from your boss:</p>
<ul class="simple">
<li>"Generative AI doesn't comply with privacy laws". Main reason: data is often
hosted by big USA companies. Well, you can use azure in Europe. There are
Dutch startups like Orquesta that help you pick the right ones. Complying
with the GDPR is possible. You can also use local models.</li>
<li>"AI hallucinates and is unreliable". He thinks it is mostly
solved. <em>Retrieval augmented generation</em> is one of the methods you can look
at. Or <em>prompt chain techniques</em> like manual validation prompts or enforcing
explicit requirements.</li>
<li>"Too expensive". Programmers are expensive and models also. So: look at
smaller, cheaper models: you often don't need the full chatgpt4. Use simpler
prompts. Perhaps create your vectorisation once: then you can run your
prompts practically for free. Oh, and chatgpt4 will drop its price by a
factor of 3.</li>
<li>"The context window is too small". (Chatgpt4 can consume bigger items since
last monday, btw). Chunking/summarizing or vector embedding can also
help. If you want it to write it an entire course, you can give it the
initial question and ask it to generate a summary. From the summary a table
of contents and from the TOC the individual chapters.</li>
<li>"Merging genAI with regular tools is hard". You can ask chatgpt to reply
with <tt class="docutils literal">json</tt>. With the json output, you can then even feed it to javscript
functions.</li>
</ul>
<p>During the talk, he showed off a project he is working on. A combination of
chatgpt4 and web scraping, switching back between the two of them.</p>
<p>The biggest challenge he sees is to create something that <em>won't</em> be taken
over by OpenAI. So don't compete with it but complement OpenAI. It is very
hard to compete with them as they're moving so quickly...</p>
</div>
]]>
</content>
</entry>
<entry>
<title>aiGrunn: state-of-the-art transformer pipelines in spaCy - Daniƫl de Kok & Madeeswaran Kannan</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/10/10-transformer-pipelines.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/10/10-transformer-pipelines.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-10T00:00:00+01:00</published>
<updated>2023-11-10T16:19:00+01:00</updated>
<category term="aigrunn" />
<category term="ai" />
<category term="python" />
<content type="html"><![CDATA[
<div class="document">
<p>(One of <a class="reference external" href="https://reinout.vanrees.org/weblog/tags/aigrunn.html">my summaries</a> of
the 2023 Dutch <a class="reference external" href="https://aigrunn.org/">aiGrunn</a> AI conference in Groningen, NL).</p>
<p>The company they work for is called "explosion", so what can go wrong? :-)</p>
<p>SpaCy (<a class="reference external" href="https://spacy.io/">https://spacy.io/</a>) is a library for natural language processing. You
give it text documents and you get them back with annotations.</p>
<p>Spacy mostly works with a pipeline. You always start with a tonenizer,
afterwards multiple optional steps and at the end the annotated document.</p>
<p>A <strong>tokenizer</strong> splits op the text. The period at the end of a sentence
doesn't belong to the last word, for instance, it is a separate
item. "Twitter's" also is "twitter" and "'s". What comes out of the
tokenization project is a <tt class="docutils literal">Doc</tt>, which behaves as a list of
tokens. <tt class="docutils literal">doc[9]</tt> can be <tt class="docutils literal">'s</tt>.</p>
<p>A useful step: <strong>lemmatisation</strong>. The token <tt class="docutils literal">accepted</tt> is annotated with the
lemma <tt class="docutils literal">accept</tt>. This makes later searching easier. <tt class="docutils literal">directors</tt> has the
lemma <tt class="docutils literal">director</tt>.</p>
<p><strong>Span classification</strong> is entity recognition. A token <tt class="docutils literal">Musk</tt> is recognised
as a "person". The tokens <tt class="docutils literal">25</tt> and <tt class="docutils literal">april</tt> in combination can be a
"date". The recognised entities and up as <tt class="docutils literal">doc.ents[number]</tt>.</p>
<p>You can do <strong>document classification</strong>. Categories like "newswire" or "love
letter" with an attached estimation ("80% chance this is a newswire").</p>
<p>Some of the transformers work with AI. Several kinds of pre-trained data
are available. What they themselves use is the <em>Groningen meaning bank</em> (GMB),
developed by the university of Groningen. More than 10k English texts, mostly
newspaper texts from the public domain. You can also look at
<a class="reference external" href="https://github.com/explosion/curated-transformers">https://github.com/explosion/curated-transformers</a> .</p>
<p>Spacy has its own plugins to provide annotations, but you can also plug in
your own. It is configured through a <tt class="docutils literal">.ini</tt> file. A <strong>project</strong> can be seen
as a sort of "makefile" for running everything. Assets (=remote sources you
want to have donwloaded), training data, what has to be run, the config, etc.</p>
<p>They showed a demo of how the whole system works. Looked nice and useful. You
can play with the demo yourself: <a class="reference external" href="https://github.com/explosion/aiGrunn-2023">https://github.com/explosion/aiGrunn-2023</a></p>
<p>Compared to a LLM like chatgpt, at the moment targeted NLP often performs
much better at classification.</p>
</div>
]]>
</content>
</entry>
<entry>
<title>aiGrunn: data versioning for machine learning practitioners - Jonathan Alexander</title>
<link rel="alternate" type="text/html"
href="http://reinout.vanrees.org/weblog/2023/11/10/1-data-versioning.html" />
<id>http://reinout.vanrees.org/weblog/2023/11/10/1-data-versioning.html</id>
<author>
<name>Reinout van Rees</name>
</author>
<published>2023-11-10T00:00:00+01:00</published>
<updated>2023-11-10T09:21:00+01:00</updated>
<category term="aigrunn" />
<category term="ai" />
<content type="html"><![CDATA[
<div class="document">
<p>(One of <a class="reference external" href="https://reinout.vanrees.org/weblog/tags/aigrunn.html">my summaries</a> of
the 2023 Dutch <a class="reference external" href="https://aigrunn.org/">aiGrunn</a> AI conference in Groningen, NL).</p>
<p>"Branches are all you need: data versioning framework for machine learning".</p>
<p>If you work with git and work with binary files, small changes give you a
completely new copy. With a couple of changes, you quickly get a huge
repository. Especially when you're a machine learning</p>
<p>A solution could be an object store (like amazon s3). Name directories like
versions, for intance. But quickly it becomes a mess. Oh, and which version in
the object store matches the versioned model parameters in git? Aargh.</p>
<p><strong>What is proper data versioning?</strong> The answer is git. That's the only
solution to keep track of everything. The core is to use branches. The
branches effectivly contain <strong>links to files stored in object storage</strong>. There
are tools for it like <a class="reference external" href="https://mlflow.org/">mlflow</a>. You tell <em>mlflow</em> to
upload/download the data, from your config in git. An alternative is <em>git lfs</em>
for large files.</p>
<ul class="simple">
<li>The <cite>main</cite> branch is for the readme, the documentation, definition of the
business problem, onboarding information. There's no data or code in here.</li>
<li>Data branches. First <cite>raw</cite>. Data first ends up here and never
deleted. Branches point at specific versions/collections.</li>
<li>Development branches. This is a combination of code and data. But don't
change the data, only the code. Make sure you're only developing in a dev
branch, not in a data branch: you want to keep the two activities separated.</li>
<li>When finished, you can tag what you have.</li>
<li>Stable branches. For (re-)training and running tests.</li>
<li>Analysis branch. Mostly for comparing models, checking algorithms.</li>
</ul>
<p>He has a demo at <a class="reference external" href="https://xethub.com/sdssio/branches-demo">https://xethub.com/sdssio/branches-demo</a> .</p>
</div>
]]>
</content>
</entry>
</feed>