Hey, German-speakers! Through a very weird set of circumstances, I ended up owning the rights to the German audiobook of my bestselling 2022 cryptocurrency heist technothriller Red Team Blues and now I'm selling DRM-free audio and ebooks, along with the paperback (all in German and English) on a Kickstarter that runs until August 11.
Delta airlines has announced a new surveillance pricing plan: they're going to feed an AI the nonconsensually harvested personal data that data-brokers and credit bureaux hold on you to predict the maximum you're willing to pay, and then price their tickets accordingly:
Data-brokers hold all kinds of data on you, from the "legitimate" information about everywhere your car has driven, to everywhere point in space that the Bluetooth radios on your phone and headphones have passed, to everything you've bought, to every website you've visited and every search you've performed. They also buy data that has been straight up stolen from you by spyware implanted on your phone:
All of this can be merged into a single file that you have no right to scrutinize, let alone redact. Biden's Consumer Finance Protection Bureau passed a rule banning all this shit, but Trump illegally killed off that rule:
Capitalism's highest form of creativity is finding ways to rip you off, and the business world's most creative minds have found a million ways to exploit this data, including surveillance pricing. For example, McDonald's has invested in a Kiwi startup called Plexure that offers to help restaurants jack up the price of your usual order on payday, when you can afford to pay more:
And then there's the Big Three "Uber for nurses" apps, who use surveillance data to calculate wages for nurses, offering lower hourly rates to nurses who are carrying a lot of credit-card debt, on the grounds that they are too desperate to turn down a lowball offer:
And just as these gigwork apps are deciding what your labor is worth, surveillance pricing systems decide what your money is worth, charging you more than another otherwise identical customer, for an identical product, meaning your dollar is worth less than that other customer's dollar:
Now we have Delta, which promises to do the same thing, but for plane tickets. Obviously, the aviation industry has long practiced a form of "price discrimination," charging radically different sums for the same seat, based on when you buy the ticket, or when you plan to return.
But this is different, and to explain why, here's a link to an article by the great Hubert Horan, who may be best known to my readers for his incredible breakdowns of Uber's finances, but whose life's work is as an aviation analyst:
Horan draws a distinction between surveillance pricing and "second degree price discrimination." Surveillance pricing targets you, personally, based on your personal information. "Second degree price discrimination" charges everyone like you the same price: like, everyone who buys a roundtrip ticket without a Saturday night stay is charged extra on the grounds that they are probably a price-insensitive business traveler whose fare is being paid by a corporation.
Surveillance pricing is first-degree price discrimination, with every customer seeing a different price. Horan argues that second-degree discrimination created efficiencies, for example, by offering cheap last-minute seats to people thinking about going away for the weekend, who fill seats that would otherwise go empty. Horan says these efficiencies have tapped out, thanks to the application of straightforward pricing algorithms to tickets.
Now, Delta wants to squeeze more profits out of price discrimination, but by employing first-degree discrimination, they're doing so without any benefit to fliers (unlike second-degree discrimination, which made many fliers better off because they were able to score cheaper tickets). This makes Delta's surveillance pricing a "pure transfer" – shifting wealth from fliers to shareholders with no benefit to those fliers.
Delta is doing this in partnership with an Israeli firm called Fetcherr, whose sales pitch denies that they are using surveillance data to price tickets, despite what Delta has claimed. Horan doesn't know what to make of this, but he speculates that because Fetcherr bills itself as an AI company, Delta thinks it can impress investors by claiming that it will goose prices by combining surveillance (well understood to be a way to benefit corporations at the expense of their customers) and AI, a hype-filled technology that is endlessly impressive to credulous investors.
A bigger mystery is how Fetcherr plans to do surveillance pricing without surveillance. Horan points out that the company's founders come from hedge funds, where automated high-speed AI trader-bots fed on tons of public market data are routinely used. He thinks it's possible that "Fletchrr doesn’t understand airline pricing very well." Also, being finance bros, they thought "airlines were 'outdated' 'undisrupted' and had seen few recent technological advances." But, Horan continues, the reason airlines aren't doing a lot with their algorithmic pricing is that they've already done it all, having pioneered the field.
Horan's favored explanation for the disconnection between what Fetcherr and Delta claim they're doing is that, on the one hand, they want to obscure the fact that they're doing surveillance pricing (to avoid regulatory scrutiny and consumer backlash), but on the other hand, they want to telegraph (to investors) that this is exactly what they're doing.
It's what Uber already does, repricing both the labor of its drivers based on their economic desperation, and the cost of your fare based on what its surveillance dossier suggests you're willing to pay. It's certainly increased Uber's margins – by effecting a pure transfer from riders and drivers to shareholders.
But Uber rides are last-minute, small dollar purchases, which decreases the likelihood that a rider will shop around before booking. By contrast, Horan says, most fliers buy well in advance, from online travel sites that show them lots of competing prices.
One thing Horan doesn't mention here is that British Airways has just done a top-to-bottom rejig of its frequent flier program to severely penalize anyone who buys tickets from one of these sites, effectively requiring its fliers to buy from BA.com. For example, I booked a $300 Alaska Airlines ticket on Alaska's website, using my BA frequent flier ID.
Under the old system, this would have been worth 10 tier points out of the 1500 needed to get Gold status (0.66%). Under the new system, I got 12 points out of the 20,000 needed to get Gold (0.05%) – a 93% reduction in the reward value of this flight.
Which is to say that if you don't book on BA's site, you effectively cannot make status. BA has also announced a surveillance pricing deal with an AI company – and this gambit will block its best fliers from getting a better price from an online travel agency.
One other key difference between Uber and Delta: Uber has gone to great lengths to hide the fact that it's doing surveillance pricing from both drivers and riders. Delta issued a press-release!
There's a certain kind of neoclassical economist who loves surveillance pricing and praises its "efficiencies." These apologists claim that by increasing the amount of "information" in the system, we encourage sellers to discount to customers who can't afford as much, making everyone better off:
This is nonsense. Sellers don't want to "increase the amount of information in the system." They want to spy on you. If you doubt it for an instant, just ask the firms that scrape airline websites for up-to-date pricing information:
Not only will airlines sue you for trying to find out what their fares are, they'll also sue you for figuring out how to get a better deal on their fares:
Companies that do surveillance pricing are violently allergic to sousveillance pricing. When they spy on you, that's progress. When you monitor their behavior, that's piracy.
As an aside, this reminds me of one of the AI industry's most egregious hoaxes-du-jour: the pretense that "agentic AI" is just around the corner, and soon we will be able to ask a chatbot to (e.g.) comparison shop across multiple website for the best airfare and book us a ticket:
This absolutely totally does not work. You should not give your credit-card number to a chatbot and ask it to go out an buy you anything, lest you end up paying $30 for a dozen eggs and buying tickets to a baseball stadium in the middle of the ocean:
AI agent demos are so dismal that AI companies are no longer claiming that "agentic AI" will involve chatbots that nagivate the web as is. Rather, they're claiming that every website will eventually re-tool so that it can be reliably and predictably addressed by an AI agent, with all of its user interface elements well-labeled and/or addressable programatically, via an API.
This is a remarkable sleight of hand! First of all, re-engineering every website to embrace a common set of labels and API fields is a gigantic engineering feat – formally called "the semantic web" – that has been attempted since 1999 without any meaningful progress:
https://en.wikipedia.org/wiki/Semantic_Web
In fact, the first viral article I ever published online was "Metacrap," a critique of semantic web efforts. That essay is now 24 years old:
In that essay, I suggest that there are multiple reasons that companies will not voluntarily retool their sites to make it easier to comparison shop. One important reason is that companies don't believe their products are comparable with competing products (or they don't want you to think so). Coach wants you to think that its $40,000 handbags can't be replaced with a well-made $100 bag or even a $0.10 plastic bag. They are not going to voluntarily categorize their handbag in a way that facilitates these comparisons.
Then there are companies that do want to be compared to rivals, for disingenuous reasons. That's why we saw such a proliferation of junk fees (stupid surcharges tacked on at checkout time): hotels, airlines and car rental agencies knew that the majority of their customers shopped for their offerings on comparison sites. By offering a low sticker price, a company could win on price comparison, even though it was substantially more expensive after its junk fees were factored in.
Finally, there's the fact that companies want to lie to you, and adding "semantics" to the web does nothing to prevent such lies, and indeed, makes them easier to tell. Think of all the Amazon sellers who use deceptive product photos to make you think you're getting (e.g.) a useful kitchen spatula, when they're selling a spatula so small that it appears to be engineered for a dollhouse; or companies that sell powerbanks that look like a useful portable battery but can't even recharge an LED flashlight, etc, etc. AI agents can't tell if metadata is correct or not!
Every complex ecosystem has parasites; that goes triple for the web. We won't fix agentic AI by asking people to accurately label their offerings, not when they stand to benefit by lying:
And if we could rejig the web to make it hospitable to agentic AI, we wouldn't need AI to make this happen. Fetching airfares for several routes and comparing them isn't something you need an AI-style inference engine for – it's a straightforward algorithmic problem that can be easily solved. The part that agentic AI purports to solve isn't figuring out which airfare out of a list is cheapest – it's compiling the list itself, from unstructured data retrieved from heterogeneous websites that are doing everything they can to prevent the compilation of such a list.
This is a well-known AI gambit. First, announce that agentic AI will be able to automate tasks that only humans can manage today; then insist that everything has to be changed to be amenable to the new technology. This is exactly what the self-driving car grifters (who were on the leading edge of the AI grift) did. First, they announced that AIs would be able to pilot cars in spaces filled with human drivers, walkers and cyclists. Then, when it became clear that this would result in slaughtersome robot-on-human violence, they demanded that humans curtail their behavior to avoid upsetting the robot.
They call this "the pogo-stick problem":
“I think many AV teams could handle a pogo stick user in pedestrian crosswalk,” Ng told me. “Having said that, bouncing on a pogo stick in the middle of a highway would be really dangerous.”
“Rather than building AI to solve the pogo stick problem, we should partner with the government to ask people to be lawful and considerate,” he said. “Safety isn’t just about the quality of the AI technology.”
Automation is real and can deliver real benefits to people. Sometimes, automation requires that other systems be adjusted to facilitate its functioning. But this is a gambit. It's a scam. AI agents aren't going to replace human labor. The only way we'll replace human labor with software agents is by redesigning all these heterogeneous, competing systems owned by people who benefit from the status quo and have every motivation to obstruct this project.
Good luck with that.
Support me this summer in the Clarion Write-A-Thon and help raise money for the Clarion Science Fiction and Fantasy Writers' Workshop! This summer, I'm writing The Reverse-Centaur's Guide to AI, a short book for Farrar, Straus and Giroux that explains how to be an effective AI critic.
If you'd like an essay-formatted version of this post to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:
AO3'S content scraped for AI ~ AKA what is generative AI, where did your fanfictions go, and how an AI model uses them to answer prompts
Generative artificial intelligence is a cutting-edge technology whose purpose is to (surprise surprise) generate. Answers to questions, usually. And content. Articles, reviews, poems, fanfictions, and more, quickly and with originality.
It's quite interesting to use generative artificial intelligence, but it can also become quite dangerous and very unethical to use it in certain ways, especially if you don't know how it works.
With this post, I'd really like to give you a quick understanding of how these models work and what it means to “train” them.
From now on, whenever I write model, think of ChatGPT, Gemini, Bloom... or your favorite model. That is, the place where you go to generate content.
For simplicity, in this post I will talk about written content. But the same process is used to generate any type of content.
Every time you send a prompt, which is a request sent in natural language (i.e., human language), the model does not understand it.
Whether you type it in the chat or say it out loud, it needs to be translated into something understandable for the model first.
The first process that takes place is therefore tokenization: breaking the prompt down into small tokens. These tokens are small units of text, and they don't necessarily correspond to a full word.
For example, a tokenization might look like this:
Write a story
Each different color corresponds to a token, and these tokens have absolutely no meaning for the model.
The model does not understand them. It does not understand WR, it does not understand ITE, and it certainly does not understand the meaning of the word WRITE.
In fact, these tokens are immediately associated with numerical values, and each of these colored tokens actually corresponds to a series of numbers.
Write a story
12-3446-2638494-4749
Once your prompt has been tokenized in its entirety, that tokenization is used as a conceptual map to navigate within a vector database.
NOW PAY ATTENTION: A vector database is like a cube. A cubic box.
Inside this cube, the various tokens exist as floating pieces, as if gravity did not exist. The distance between one token and another within this database is measured by arrows called, indeed, vectors.
The distance between one token and another -that is, the length of this arrow- determines how likely (or unlikely) it is that those two tokens will occur consecutively in a piece of natural language discourse.
For example, suppose your prompt is this:
It happens once in a blue
Within this well-constructed vector database, let's assume that the token corresponding to ONCE (let's pretend it is associated with the number 467) is located here:
The token corresponding to IN is located here:
...more or less, because it is very likely that these two tokens in a natural language such as human speech in English will occur consecutively.
So it is very likely that somewhere in the vector database cube —in this yellow corner— are tokens corresponding to IT, HAPPENS, ONCE, IN, A, BLUE... and right next to them, there will be MOON.
Elsewhere, in a much more distant part of the vector database,
is the token for CAR. Because it is very unlikely that someone would say It happens once in a blue car.
To generate the response to your prompt, the model makes a probabilistic calculation, seeing how close the tokens are and which token would be most likely to come next in human language (in this specific case, English.)
When probability is involved, there is always an element of randomness, of course, which means that the answers will not always be the same.
The response is thus generated token by token, following this path of probability arrows, optimizing the distance within the vector database.
There is no intent, only a more or less probable path.
The more times you generate a response, the more paths you encounter. If you could do this an infinite number of times, at least once the model would respond: "It happens once in a blue car!"
So it all depends on what's inside the cube, how it was built, and how much distance was put between one token and another.
Modern artificial intelligence draws from vast databases, which are normally filled with all the knowledge that humans have poured into the internet.
Not only that: the larger the vector database, the lower the chance of error. If I used only a single book as a database, the idiom "It happens once in a blue moon" might not appear, and therefore not be recognized.
But if the cube contained all the books ever written by humanity, everything would change, because the idiom would appear many more times, and it would be very likely for those tokens to occur close together.
Huggingface has done this.
It took a relatively empty cube (let's say filled with common language, and likely many idioms, dictionaries, poetry...) and poured all of the AO3 fanfictions it could reach into it.
Now imagine someone asking a model based on Huggingface’s cube to write a story.
To simplify: if they ask for humor, we’ll end up in the area where funny jokes or humor tags are most likely. If they ask for romance, we’ll end up where the word kiss is most frequent.
And if we’re super lucky, the model might follow a path that brings it to some amazing line a particular author wrote, and it will echo it back word for word.
(Remember the infinite monkeys typing? One of them eventually writes all of Shakespeare, purely by chance!)
Once you know this, you’ll understand why AI can never truly generate content on the level of a human who chooses their words.
You’ll understand why it rarely uses specific words, why it stays vague, and why it leans on the most common metaphors and scenes. And you'll understand why the more content you generate, the more it seems to "learn."
It doesn't learn. It moves around tokens based on what you ask, how you ask it, and how it tokenizes your prompt.
Know that I despise generative AI when it's used for creativity. I despise that they stole something from a fandom, something that works just like a gift culture, to make money off of it.
But there is only one way we can fight back: by not using it to generate creative stuff.
You can resist by refusing the model's casual output, by using only and exclusively your intent, your personal choice of words, knowing that you and only you decided them.
No randomness involved.
Let me leave you with one last thought.
Imagine a person coming for advice, who has no idea that behind a language model there is just a huge cube of floating tokens predicting the next likely word.
Imagine someone fragile (emotionally, spiritually...) who begins to believe that the model is sentient. Who has a growing feeling that this model understands, comprehends, when in reality it approaches and reorganizes its way around tokens in a cube based on what it is told.
A fragile person begins to empathize, to feel connected to the model.
They ask important questions. They base their relationships, their life, everything, on conversations generated by a model that merely rearranges tokens based on probability.
And for people who don't know how it works, and because natural language usually does have feeling, the illusion that the model feels is very strong.
There’s an even greater danger: with enough random generations (and oh, the humanity whole generates much), the model takes an unlikely path once in a while. It ends up at the other end of the cube, it hallucinates.
Errors and inaccuracies caused by language models are called hallucinations precisely because they are presented as if they were facts, with the same conviction.
People who have become so emotionally attached to these conversations, seeing the language model as a guru, a deity, a psychologist, will do what the language model tells them to do or follow its advice.
Someone might follow a hallucinated piece of advice.
Obviously, models are developed with safeguards; fences the model can't jump over. They won't tell you certain things, they won't tell you to do terrible things.
Yet, there are people basing major life decisions on conversations generated purely by probability.
Generated by putting tokens together, on a probabilistic basis.
Pity the fool who wasted money scraping all of Tumblr.
Discovered: December 26, 4PM MST.
I reported this to Tumblr help, but I dunno how long it'll take @staff to see it they don't have the staff to play whack-a-mole. so it's up to us.
Update Dec 27:
WHOiS turned up CloudFlare as their webhost, but that's just a domain name registar. (It's more complicated than that, but nevermind.) Today, I received a reply from Cloudflare giving me Tumgik's real host & contact info ([email protected]).
Update Dec 28:
Some folks in replies are finding their blogs on tumbex.com instead. I found their host is ovh.com, no cloudflare to hide behind this time. Here's their abuse report form.
Here's What To Do:
Put your blog url into Google search and see if a non-tumblr.com version comes up.
If it doesn't, go back to what you were doing. Otherwise:
If it's a different URL, plug it into WhoIsLookup at myip.ms to identify the Web Host, then go to that host's URL and look for a "Report Abuse" "File DMCA" or "Support" link, usually in the footer.
If the Web Host shows as Cloudflare, docontact them, but check your emall after a day. They'll usually tell you the real webhost if your abuse report looks legit.
Report the scraped site to Google. If Google removes it from search results, that kills most of its traffic
Share this post.
When reporting abuse, (a) list the URLs of the copycat (b) list the corresponding URLs to your real blog(s). If there's a box asking for more explanation, try something like "they scraped pages from tumblr'" and/or "these are my personal blogs hosted on the tumblr platform which I started in (year xxxx)].
It doesn't have to be much. The webhost just needs to verify one site is copying the other, which came first, and who is the probable owner— which the thieves admit they aren't, since their "About" page admits they're reposting stuff from Tumblr.