@vrighter@ylai
That is a really bad analogy. If the "compilation" takes 6 months on a farm of 1000 GPUs and the results are random, then the dataset is basically worthless compared to the model. Datasets are easily available, always were, but if someone invests the effort in the training, then they don't want to let others use the model as open-source. Which is why we want open-source models. But not "openwashed" where they call it "open" for non-commercial, no modifications, no redistribution
I think technically, the source should be the native format of whatever image manipulation program that you use. For vector graphics, there is svg format but the native editor is still preferable. Otherwise, whoever gets the end copy cannot easily modify or reproduce it, only copy it. But it of course depends on the definition of “easy” and a lot of other factors. Licensing is hard and it is because I am not a lawyer.
It would depend on the format what is counted as source, and what isn’t.
You can create a picture by hand, using no input data.
I challenge you to do the same for model weights. If you truly just sit down and type away numbers in a file, then yes, the model would have no further source. But that is not something that can be done in practice.
Are you sure that you can reproduce the model, given the same inputs? Reproducibility is a difficult property to achieve. I wouldn’t think LLMs are reproduce.
In theory, if you have the inputs, you have reproducible outputs, modulo perhaps some small deviations due to non-deterministic parallelism. But if those effects are large enough to make your model perform differently you already have big issues, no different than if a piece of software performs differently each time it is compiled.
I would consider the “source code” for artwork to be the project file, with all of the layers intact and whatnot. The Photoshop PSD, the GIMP XCF or the Krita KRA. The “compiled” version would be the exported PNG/JPG.
You can license a compiled binary under CC BY if you want. That would allow users to freely decompile/disassemble it or to bundle the binary for their purposes, but it’s different from releasing source code. It’s closed source, but under a free license.
The situation is somewhat different and nuanced. With weights there are tools for fine-tuning, LoRA/LoHa, PEFT, etc., which presents a different situation as with binaries for programs. You can see that despite e.g. LLaMA being “compiled”, others can significantly use it to make models that surpass the previous iteration (see e.g. recently WizardLM 2 in relation to LLaMA 2). Weights are also to a much larger degree architecturally independent than binaries (you can usually cross train/inference on GPU, Google TPU, Cerebras WSE, etc. with the same weights).
How is that different then e.g. patching a closed-sourced binary? There are plenty of community patches to old games to e.g. make them work on newer hardware. Architectural independence seems irrelevant, it’s no different than e.g Java bytecode.
This is a very shallow analogy. Fine-tuning is rather the standard technical approach to reduce compute, even if you have access to the code and all training data. Hence there has always been a rich and established ecosystem for fine-tuning, regardless of “source.” Patching closed-source binaries is not the standard approach, since compilation is far less computational intensive than today’s large scale training.
Java byte codes are a far fetched example. JVM does assume a specific architecture that is particular to the CPU-dominant world when it was developed, and Java byte codes cannot be trivially executed (efficiently) on a GPU or FPGA, for instance.
And by the way, the issue of weight portability is far more relevant than the forced comparison to (simple) code can accomplish. Usually today’s large scale training code is very unique to a particular cluster (or TPU, WSE), as opposed to the resulting weight. Even if you got hold of somebody’s training code, you often have to reinvent the wheel to scale it to your own particular compute hardware, interconnect, I/O pipeline, etc… This is not commodity open source on your home PC or workstation.
The analogy works perfectly well. It does not matter how common it is. Pstching binaries is very hard compared to e.g. LoRA. But it is still essentially the same thing, making a derivative work by modifying parts of the original.
How does this analogy work at all? LoRA is chosen by the modifier to be low ranked to accommodate some desktop/workstation memory constraint, not because the other weights are “very hard” to modify if you happens to have the necessary compute and I/O. The development in LoRA is also largely directed by storage reduction (hence not too many layers modified) and preservation of the generalizability (since training generalizable models is hard). The Kronecker product versions, in particular, has been first developed in the context of federated learning, and not for desktop/workstation fine-tuning (also LoRA is fully capable of modifying all weights, it is rather a technique to do it in a correlated fashion to reduce the size of the gradient update). And much development of LoRA happened in the context of otherwise fully open datasets (e.g. LAION), that are just not manageable in desktop/workstation settings.
This narrow perspective of “source” is taking away the actual usefulness of compute/training here. Datasets from e.g. LAION to Common Crawl have been available for some time, along with training code (sometimes independently reproduced) for the Imagen diffusion model or GPT. It is only when e.g. GPT-J came along that somebody invested into the compute (including how to scale it to their specific cluster) that the result became useful.
As usual doing malicious compliance, like when they pretended that iOS and iPadOS were two completely separate operating systems and so iPadOS shouldn’t need to support third party app stores as EU said “iOS”
Probably because there aren’t any, they can’t specifically say “iOS”.
I’m not aware of any other operating system (except the ones in game consoles or dedicated hw) that doesn’t allow the user to install other software not approved by the manufacturer
Actually the more I think about it the more it seems like the only, legally fair decision. Either all of them are demanded to allow alternative app stores or none of them are. Why should the consoles be any different in this regard? 🤔
Signal is fully open source! You can run it on-premises, if you know your business!
Why are we not talking about it?
Unless something has drastically changed recently, the official Signal service won’t interoperate with anyone else’s instance. That makes its source code practically useless for general-purpose messaging, which might explain why few are talking about it.
My point is that you have all the open source software components needed to run secure communications, on your own premises, for your own users/community in case you are not trusting Signal’s infrastructure.
If you know any other similar alternative with strong encryption open source protocols please let me know! I love learning new things everyday!
on your own premises, for your own users/community in case you are not trusting Signal’s infrastructure.
Yes, that’s an example of data (and infrastructure) sovereignty. It’s good for self-contained groups, but is not general-purpose messaging, since it doesn’t allow communication with anyone outside your group.
If you know any other similar alternative with strong encryption open source protocols please let me know! I love learning new things everyday!
Matrix can do this. It also has support for communicating across different server instances worldwide (both public and private), and actively supports interoperability with other messaging networks, both in the short term through bridges and in the long term through the IETF’s More Instant Messaging Interoperability (MIMI) working group.
XMPP can do on-premise encrypted messaging, too. Technically, it can also support global encrypted messaging with fairly modern features, with the help of carefully selected extensions and server software and clients, although this quickly becomes impractical for general-purpose messaging, mainly because of availability and usability: Managed free servers with the right components are in short supply and often don’t last for long, and the general public doesn’t have the tech skills to do it themselves. (Availability was not a problem when Google and Facebook supported it, but that support ended years ago.) It’s still useful for relatively small groups, though, if you have a skilled admin to maintain the servers and help the users.
I’m always amazed how people come out of the woodwork to defend Signal any time any criticism of it comes up. It’s become a sacred cow that cannot be questioned. Whatever you may think of Telegram should bear zero weight on your views of Signal.
The reality is that developers of Signal have close ties to US security agencies. It’s a centralized app hosted in US and subject to US laws. It’s been forcing people to use their phone numbers to register, and this creates a graph of real world contacts people have. This alone is terrible from security/privacy perspective. It doesn’t have reproducible builds on iOS, which means you have no guarantee regarding what you’re actually running. These are just a handful of things that are publicly known.
And then we know stuff like this happens. NSA suggested using specific numbers for encryption that it knew how to factor quickly. The algorithm itself was secure, but the specific configuration of how the algorithm was implemented allowed for the exploit thehackernews.com/…/nsa-crack-encryption.html
These kinds of backdoors are very difficult to audit for because if you don’t know what to look for then you won’t have any reason to suspect a particular configuration to be malicious. Given the relationship between people working on Signal and US government, this is a real concern.
The same kind of scrutiny people apply to Telegram and other messaging apps should absolutely be applied to Signal as well.
I’d just like to add that you can use a temporary phone number service to sign up to Signal as you only need a phone number to register, not to actually use Signal.
Idk how secure telegram is but cmon signal is shady AF . They won’t let fdroid have it cause they want to sign their own keys or some shit but there is a speculation its because they can roll out custom apk to targets which governments want which is just not possible if it is hosted by someone like fdroid . Even telegram allows that and they even allow third party apps which signal won’t .
SimpleX and briar is the best option if your actually worried about privacy .
This comment is copy pasted from another thread where I had the same opinion
Signal stans do not have an answer to this. OMEMO is verifiable, rest of the stuff around it is not. Signal even had a time when they did not update the backend open source code for over 6 months.
Signal and Telegram are not rivals, though? Signal aims to be a E2EE chat platform, while Telegram works like a public forum in realtime chat format. Signal/WhatsApp are different from Telegram/Discord. They are not the same type of platforms.
Durov is comparing apples and oranges, and anyone falling for this whining, calling Telegram bad is an idiot.
There’s no oversight for any of these agencies and they have the means and incentive to backdoor cryptography, what would stop them from doing this morality? There’s no possible way that they both aren’t compromised and all we’re seeing now is them firing pot shots at each other trying to convince the reader to join their honeypot because its sweeter.
I know that Telegram has a lot of users, so I'm not describing all of them here. But I've noticed that it seems especially popular among people who kind of like to "play pretend" as underground hackers. You know, the kind of person who likes to imagine that the government would be after them.
This mudslinging feels like more of a marketing campaign than anything else. An info op that will work well on the Telegram users who like to imagine that they have outmaneuvered all the info ops.
theregister.com
Newest